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Preface to the First Edition 


We first came to focus on what is now known as reinforcement learning in late 1979. We were both 
at the University of Massachusetts, working on one of the earliest projects to revive the idea that 
networks of neuronlike adaptive elements might prove to be a promising approach to artificial adaptive 
intelligence. The project explored the “lreterostatic theory of adaptive systems” developed by A. Harry 
Klopf. Harry’s work was a rich source of ideas, and we were permitted to explore them critically and 
compare them with the long history of prior work in adaptive systems. Our task became one of teasing 
the ideas apart and understanding their relationships and relative importance. This continues today, 
but in 1979 we came to realize that perhaps the simplest of the ideas, which had long been taken for 
granted, had received surprisingly little attention from a computational perspective. This was simply 
the idea of a learning system that wants something, that adapts its behavior in order to maximize a 
special signal from its environment. This was the idea of a “hedonistic” learning system, or, as we 
would say now, the idea of reinforcement learning. 

Like others, we had a sense that reinforcement learning had been thoroughly explored in the early 
days of cybernetics and artificial intelligence. On closer inspection, though, we found that it had 
been explored only slightly. While reinforcement learning had clearly motivated some of the earliest 
computational studies of learning, most of these researchers had gone on to other things, such as 
pattern classification, supervised learning, and adaptive control, or they had abandoned the study of 
learning altogether. As a result, the special issues involved in learning how to get something from the 
environment received relatively little attention. In retrospect, focusing on this idea was the critical 
step that set this branch of research in motion. Little progress could be made in the computational 
study of reinforcement learning until it was recognized that such a fundamental idea had not yet been 
thoroughly explored. 

The field has come a long way since then, evolving and maturing in several directions. Reinforcement 
learning has gradually become one of the most active research areas in machine learning, artificial 
intelligence, and neural network research. The field has developed strong mathematical foundations and 
impressive applications. The computational study of reinforcement learning is now a large field, with 
hundreds of active researchers around the world in diverse disciplines such as psychology, control theory, 
artificial intelligence, and neuroscience. Particularly important have been the contributions establishing 
and developing the relationships to the theory of optimal control and dynamic programming. The 
overall problem of learning from interaction to achieve goals is still far from being solved, but our 
understanding of it has improved significantly. We can now place component ideas, such as temporal- 
difference learning, dynamic programming, and function approximation, within a coherent perspective 
with respect to the overall problem. 

Our goal in writing this book was to provide a clear and simple account of the key ideas and algorithms 
of reinforcement learning. We wanted our treatment to be accessible to readers in all of the related 
disciplines, but we could not cover all of these perspectives in detail. For the most part, our treatment 
takes the point of view of artificial intelligence and engineering. Coverage of connections to other fields 
we leave to others or to another time. We also chose not to produce a rigorous formal treatment of 


IX 



X 


Preface to the First Edition 


reinforcement learning. We did not reach for the highest possible level of mathematical abstraction 
and did not rely on a theorem-proof format. We tried to choose a level of mathematical detail that 
points the mathematically inclined in the right directions without distracting from the simplicity and 
potential generality of the underlying ideas. 

[Three paragraphs elided in favor of updated content in the second edition.] 

In some sense we have been working toward this book for thirty years, and we have lots of people 
to thank. First, we thank those who have personally helped us develop the overall view presented 
in this book: Harry Klopf, for helping us recognize that reinforcement learning needed to be revived; 
Chris Watkins, Dimitri Bertsekas, John Tsitsiklis, and Paul Werbos, for helping us see the value of the 
relationships to dynamic programming; John Moore and Jim Kehoe, for insights and inspirations from 
animal learning theory; Oliver Selfridge, for emphasizing the breadth and importance of adaptation; 
and, more generally, our colleagues and students who have contributed in countless ways: Ron Williams, 
Charles Anderson, Satinder Singh, Sridhar Mahadevan, Steve Bradtke, Bob Crites, Peter Dayan, and 
Leemon Baird. Our view of reinforcement learning has been significantly enriched by discussions with 
Paul Cohen, Paul Utgoff, Martha Steenstrup, Gerry Tesauro, Mike Jordan, Leslie Kaelbling, Andrew 
Moore, Chris Atkeson, Tom Mitchell, Nils Nilsson, Stuart Russell, Tom Dietterich, Tom Dean, and Bob 
Narendra. We thank Michael Littman, Gerry Tesauro, Bob Crites, Satinder Singh, and Wei Zhang 
for providing specifics of Sections 4.7, 15.1, 15.4, 15.5, and 15.6 respectively. We thank the Air Force 
Office of Scientific Research, the National Science Foundation, and GTE Laboratories for their long and 
farsighted support. 

We also wish to thank the many people who have read drafts of this book and provided valuable 
comments, including Tom Kalt, John Tsitsiklis, Pawel Cichosz, Olle Gallmo, Chuck Anderson, Stu¬ 
art Russell, Ben Van Roy, Paul Steenstrup, Paul Cohen, Sridhar Mahadevan, Jette Randlov, Brian 
Sheppard, Thomas O’Connell, Richard Coggins, Cristina Versino, John H. Hiett, Andreas Badelt, Jay 
Ponte, Joe Beck, Justus Piater, Martha Steenstrup, Satinder Singh, Tommi Jaakkola, Dimitri Bert¬ 
sekas, Torbjorn Ekrnan, Christina Bjbrknran, Jakob Carlstronr, and Olle Palmgren. Finally, we thank 
Gwyn Mitchell for helping in many ways, and Harry Stanton and Bob Prior for being our champions 
at MIT Press. 
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The twenty years since the publication of the first edition of this book have seen tremendous progress 
in artificial intelligence, propelled in large part by advances in machine learning, including advances 
in reinforcement learning. Although the impressive computational power that became available is 
responsible for some of these advances, new developments in theory and algorithms have been driving 
forces as well. In the face of this progress, a second edition of our 1998 book was long overdue, and 
we finally began the project in 2013. Our goal for the second edition was the same as our goal for the 
first: to provide a clear and simple account of the key ideas and algorithms of reinforcement learning 
that is accessible to readers in all the related disciplines. The edition remains an introduction, and we 
retain a focus on core, on-line learning algorithms. This edition includes some new topics that rose to 
importance over the intervening years, and we expanded coverage of topics that we now understand 
better. But we made no attempt to provide comprehensive coverage of the field, which has exploded in 
many different directions with outstanding contributions by many active researchers. We apologize for 
having to leave out all but a handful of these contributions. 

As in the first edition, we chose not to produce a rigorous formal treatment of reinforcement learning, 
or to formulate it in the most general terms. However, since the first edition, our deeper understanding 
of some topics required a bit more mathematics to explain; we have set off the more mathematical 
parts in shaded boxes that the non-mathematically-inclined may choose to skip. We also use a slightly 
different notation than was used in the first edition. In teaching, we have found that the new notation 
helps to address some common points of confusion. It emphasizes the difference between random 
variables, denoted with capital letters, and their instantiations, denoted in lower case. For example, 
the state, action, and reward at time step t are denoted St, At, and Rt, while their possible values 
might be denoted s, a, and r. Along with this, it is natural to use lower case for value functions (e.g., 
tv) and restrict capitals to their tabular estimates (e.g., Q t {s,a)). Approximate value functions are 
deterministic functions of random parameters and are thus also in lower case (e.g., t i(s,w t ) ss tv(s)). 
Vectors, such as the weight vector w t (formerly 9 t ) and the feature vector x t (formerly <p t ), are bold 
and written in lowercase even if they are random variables. Uppercase bold is reserved for matrices. In 
the first edition we used special notations, 1P“ S / and for the transition probabilities and expected 
rewards. One weakness of that notation is that it still did not fully characterize the dynamics of the 
rewards, giving only their expectations. Another weakness is the excess of subscripts and superscripts. 
In this edition we use the explicit notation of p(s ', r | s, a) for the joint probability for the next state and 
reward given the current state and action. All the changes in notation are summarized in a table on 
page xv. 

The second edition is significantly expanded, and its top-level organization has been revamped. 
After the introductory first chapter, the second edition is divided into three new parts. The first part 
(Chapters 2-8) treats as much of reinforcement learning as possible without going beyond the tabular 
case for which exact solutions can be found. We cover both learning and planning methods for the 
tabular case, as well as their unification in n-step methods and in Dyna. Many algorithms presented in 
this part are new to the second edition, including UCB, Expected Sarsa, Double learning, tree-backup, 
Q(cr), RTDP, and MCTS. Doing the tabular case first, and thoroughly, enables core ideas to be developed 
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in the simplest possible setting. The whole second part of the book (Chapters 9-13) is then devoted to 
extending the ideas to function approximation. It has new sections on artificial neural networks, the 
fourier basis, LSTD, kernel-based methods, Gradient-TD and Emphatic-TD methods, average-reward 
methods, true online TD(A), and policy-gradient methods. The second edition significantly expands 
the treatment of off-policy learning, first for the tabular case in Chapters 5-7, then with function 
approximation in Chapters 11 and 12. Another change is that the second edition separates the forward- 
view idea of n-step bootstrapping (now treated more fully in Chapter 7) from the backward-view idea 
of eligibility traces (now treated independently in Chapter 12). The third part of the book has large 
new chapters on reinforcement learning’s relationships to psychology (Chapter 14) and neuroscience 
(Chapter 15), as well as an updated case-studies chapter including Atari game playing, Watson, and 
AlphaGo (Chapter 16). Still, out of necessity we have included only a small subset of all that has been 
done in the field. Our choices reflect our long-standing interests in inexpensive model-free methods 
that should scale well to large applications. The final chapter now includes a discussion of the future 
societal impacts of reinforcement learning. For better or worse, the second edition is about 60% longer 
than the first. 

This book is designed to be used as the primary text for a one- or two-semester course on rein¬ 
forcement learning. For a one-semester course, the first ten chapters should be covered in order and 
form a good core, to which can be added material from the other chapters, from other books such 
as Bertsekas and Tsitsiklis (1996), Weiring and van Otterlo (2012), and Szepesvari (2010), or from 
the literature, according to taste. Depending of the students’ background, some additional material 
on online supervised learning may be helpful. The ideas of options and option models are a natural 
addition (Sutton, Precup and Singh, 1999). A two-semester course can cover all the chapters as well 
as supplementary material. The book can also be used as part of broader courses on machine learning, 
artificial intelligence, or neural networks. In this case, it may be desirable to cover only a subset of 
the material. We recommend covering Chapter 1 for a brief overview, Chapter 2 through Section 2.4, 
Chapter 3, and then selecting sections from the remaining chapters according to time and interests. 
Chapter 6 is the most important for the subject and for the rest of the book. A course focusing on 
machine learning or neural networks should cover Chapters 9 and 10, and a course focusing on artificial 
intelligence or planning should cover Chapter 8. Throughout the book, sections and chapters that are 
more difficult and not essential to the rest of the book are marked with a *. These can be omitted on 
first reading without creating problems later on. Some exercises are also marked with a * to indicate 
that they are more advanced and not essential to understanding the basic material of the chapter. 

Most chapters end with a section entitled “Bibliographical and Historical Remarks,” wherein we credit 
the sources of the ideas presented in that chapter, provide pointers to further reading and ongoing 
research, and describe relevant historical background. Despite our attempts to make these sections 
authoritative and complete, we have undoubtedly left out some important prior work. For that we again 
apologize, and we welcome corrections and extensions for incorporation into the electronic version of 
the book. 

Like the first edition, this edition of the book is dedicated to the memory of A. Harry Klopf. It was 
Harry who introduced us to each other, and it was his ideas about the brain and artificial intelligence 
that launched our long excursion into reinforcement learning. Trained in neurophysiology and long 
interested in machine intelligence, Harry was a senior scientist affiliated with the Avionics Directorate 
of the Air Force Office of Scientific Research (AFOSR) at Wright-Patterson Air Force Base, Ohio. He was 
dissatisfied with the great importance attributed to equilibrium-seeking processes, including homeostasis 
and error-correcting pattern classification methods, in explaining natural intelligence and in providing 
a basis for machine intelligence. He noted that systems that try to maximize something (whatever that 
might be) are qualitatively different from equilibrium-seeking systems, and he argued that maximizing 
systems hold the key to understanding important aspects of natural intelligence and for building artificial 
intelligences. Harry was instrumental in obtaining funding from AFOSR for a project to assess the 
scientific merit of these and related ideas. This project was conducted in the late 1970s at the University 
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of Massachusetts Amherst (UMass Amherst), initially under the direction of Michael Arbib, William 
Kilmer, and Nico Spinelli, professors in the Department of Computer and Information Science at UMass 
Amherst, and founding members of the Cybernetics Center for Systems Neuroscience at the University, 
a farsighted group focusing on the intersection of neuroscience and artificial intelligence. Barto, a 
recent Ph.D. from the University of Michigan, was hired as post doctoral researcher on the project. 
Meanwhile, Sutton, an undergraduate studying computer science and psychology at Stanford, had 
been corresponding with Harry regarding their mutual interest in the role of stimulus timing in classical 
conditioning. Harry suggested to the UMass group that Sutton would be a great addition to the project. 
Thus, Sutton became a UMass graduate student, whose Ph.D. was directed by Barto, who had become 
an Associate Professor. The study of reinforcement learning as presented in this book is rightfully an 
outcome of that project instigated by Harry and inspired by his ideas. Further, Harry was responsible 
for bringing us, the authors, together in what has been a long and enjoyable interaction. By dedicating 
this book to Harry we honor his essential contributions, not only to the field of reinforcement learning, 
but also to our collaboration. We also thank Professors Arbib, Kilmer, and Spinelli for the opportunity 
they provided to us to begin exploring these ideas. Finally, we thank AFOSR for generous support over 
the early years of our research, and the NSF for its generous support over many of the following years. 

We have very many people to thank for their inspiration and help with this second edition. Everyone 
we acknowledged for their inspiration and help with the first edition deserve our deepest gratitude for 
this edition as well, which would not exist were it not for their contributions to edition number one. 
To that long list we must add many others who contributed specifically to the second edition. Our 
students over the many years that we have taught this material contributed in countless ways: exposing 
errors, offering fixes, and—not the least— being confused in places where we could have explained things 
better. We thank many readers on the internet for finding errors and potential points of confusion in 
the second edition, and specifically Martha Steenstrup for reading and providing detailed comments 
throughout. The chapters on psychology and neuroscience could not have been written without the 
help of many experts in those fields. We thank John Moore for his patient tutoring over many many 
years on animal learning experiments, theory, and neuroscience, and for his careful reading of multiple 
drafts of Chapters 14 and 15. We also thank Matt Botvinick, Nathaniel Daw, Peter Dayan, and Yael 
Niv for their penetrating comments on drafts of these chapter, their essential guidance through the 
massive literature, and their interception of many of our errors in early drafts. Of course, the remaining 
errors in these chapters—and there must still be some—are totally our own. We owe Phil Thomas 
thanks for helping us make these chapters accessible to non-psychologists and non-neuroscientists. We 
thank Jim Houk for introducing us to the subject of information processing in the basal ganglia. Jose 
Martinez, Terry Sejnowski, David Silver, Gerry Tesauro, Georgios Theocharous, and Phil Thomas 
generously helped us understand details of their reinforcement learning applications for inclusion in the 
case-studies chapter and commented on drafts of these sections. Special thanks are owed to David Silver 
for helping us better understand Monte Carlo Tree Search and the DeepMind Go-playing programs. 
We thank George Konidaris for his help with the section on the Fourier basis. Emilio Cartoni, Stefan 
Dernbach, Clemens Rosenbaum, and Patrick Taylor helped us in a number important ways for which 
we are most grateful. 

Sutton would also like to thank the members of the Reinforcement Learning and Artificial Intelligence 
laboratory at the University of Alberta for contributions to the second edition. He owes a particular 
debt to Rupam Mahmood for essential contributions to the treatment of off-policy Monte Carlo methods 
in Chapter 5, to Hamid Maei for helping develop the perspective on off-policy learning presented in 
Chapter 11, to Eric Graves for conducting the experiments in Chapter 13, to Shantong Zhang for 
replicating and thus verifying almost all the experimental results, to Kris De Asis for improving the 
new technical content of Chapter 12, to Harm van Seijen for insights that led to the separation of 
n-step methods from eligibility traces and, along with Hado van Hasselt, for the ideas involving exact 
equivalence of forward and backward views of eligibility traces presented in Chapter 12. Sutton would 
also like to gratefully acknowledge the support and freedom he was granted by the Government of 
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Alberta and the National Science and Engineering Research Council of Canada throughout the period 
during which the second edition was conceived and written. In particular, he would like to thank Randy 
Goebel for creating a supportive and far-sighted environment for research in Alberta. 



Summary of Notation 


Capital letters are used for random variables, whereas lower case letters are used for the values of 
random variables and for scalar functions. Quantities that are required to be real-valued vectors are 
written in bold and in lower case (even if random variables). Matrices are bold capitals. 


Pr{X = x} 

X~p 

E[X] 

argmax a /(a) 

lnx 

exp(x) 

R 


equality relationship that is true by definition 
approximately equal 

probability that a random variable X takes on the value x 
the random variable X is selected with distribution p(x) = Pr{A = ;r} 
expectation of a random variable X, i.e., E[A] = J2 X P( X ) X 
a value of a at which /(a) takes its maximal value 
natural logarithm of x 

e x , where e s=s 2.71828 is the base of the natural logarithm 
set of real numbers 


e 

7 

A 

H predicate 


probability of taking a random action in an e-greedy policy 

step-size parameters 

discount-rate parameter 

decay-rate parameter for eligibility traces 

An indicator function (1 is the predicate is true, else 0) 


In a multi- 

k 

t 

q*{a) 

Qt{a) 

N t (a) 

H t (a) 

ih(a) 

Rt 


arm bandit problem: 

number of actions (arms) 
discrete time step or play number 
true value (expected reward) of action a 
estimate at time t of g*(a) 

number of times action a has been selected up prior to time t 
learned preference for selecting action a 
probability of selecting action a on time t 
estimate at time t of the expected reward given 7r 


In a Markov Decision Process: 
s, s' states 

a an action 

r a reward 

§ set of all nonterminal states 

S + set of all states, including the terminal state 

A set of all actions 

3? set of all possible rewards, a finite subset of R 

C subset of, e.g., Ik C R 

G is an element of, e.g., s € S, r £ A 
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|S| 


number of elements in set § 


t 

T,T(t) 

At 

S t 

Rt 

7T 

7r(s) 
7r(a|s) 
7r(a|s, 6 ) 


discrete time step 

final time step of an episode, or of the episode including time step t 
action at time t 

state at time t, typically due, stochastically, to S t -i and A t _ 1 
reward at time t , typically due, stochastically, to St-i and A t - 1 
policy, decision-making rule 

action taken in state s under deterministic policy tt 
probability of taking action a in state s under stochastic policy n 
probability of taking action a in state s given parameter 6 


G_t 

Gf.h 

Gt s 

G Xa 

G Xs 

t:h 

G Xa 
t:h 


return (cumulative discounted reward) following time t (Section 3.3) 
flat return (uncorrected, undiscounted) from t + 1 to h (Section 5.8) 
A-return, corrected by estimated state values (Section 12.1) 

A-return, corrected by estimated action values (Section 12.1) 
truncated, corrected A-return, with state values (Section 12.3) 
truncated, corrected A-return, with action values (Section 12.3) 


p(s',r\s,a) probability of transition to state s' with reward r, from state s and action a 

p(s' |s, a) probability of transition to state s', from state s taking action a 

r(s,a,s') expected immediate reward on transition from s to s' under action a 


Vir(s) 

v*(s) 

q7r(s,a) 

q*{s, a) 
V,V t 
Q, Qt 


value of state s under policy 7r (expected return) 

value of state s under the optimal policy 

value of taking action a in state s under policy tt 

value of taking action a in state s under the optimal policy 

array estimates of state-value function v n or i>* 

array estimates of action-value function q n or <7* 


<5* 

w, w t 

d 

d! 

in 

v(s, w) 
v w (s) 
q{s,a, w) 
x(s) 
x(s, a) 

Xi(s),Xi(s,a) 

x t 

w T x 

p(s) 


temporal-difference error at t (a random variable) (Section 6.1) 
d-vector of weights underlying an approximate value function 
ith component of learnable weight vector 
dimensionality—the number of components of w 
alternate dimensionality—the number of components of 6 
number of Is in a sparse binary feature vector 
approximate value of state s given weight vector w 
alternate notation for u(s,w) 

approximate value of state-action pair s,a given weight vector w 

vector of features visible when in state s 

vector of features visible when in state s taking action a 

ith component of vector x(s) or x(s, a) 

shorthand for x(S' i ) or x(S tl A t ) 

inner product of vectors, w T x = )T) • WiXp, e.g., v{s, w) = w T x(s) 
onpolicy distribution over states (Section 9.2) 

|S|-vector of the p(s) 

^.-weighted norm of any vector x(s), i.e., p(s)x( s )f (Section 11.4) 


v,v t secondary d-vector of weights, used to learn w (Chapter 11) 

z t d-vector of eligibility traces at time t (Chapter 12) 
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o,o t 

ne 

h(s, a, 0) 


parameter vector of target policy (Chapter 13) 
policy corresponding to parameter 6 
performance measure for policy 7r or ttq 
a preference for selecting action a in state s based on 6 


b 


Pt:h 


Pt 

r(ir) 

Rt 


behavior policy selecting actions while learning about target policy ir, 
or a baseline function b :§>—>■ M. for policy-gradient methods 
or a branching factor 

importance sampling ratio for time t to time h (Section 5.5) 
importance sampling ratio for time t alone, p t = pt-t 
average reward (reward rate) for policy it (Section 10.3) 
estimate of r(ir) at time t 


A 

b 

wtd 

I 


d x d matrix A = E 


x t (x t - 7X t+ i) 


(Section 11.4) 


d-dimensional vector b = E[i? i+1 x t ] 

TD fixed point, wtd = A _1 b (d- vector) 
identity matrix 

|S| x |S| matrix of state-transition probabilities under it 
|S| x |§| diagonal matrix with p(s) on its diagonal 
|S| x d matrix with x(s) as its rows 
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Chapter 1 


Introduction 


The idea that we learn by interacting with our environment is probably the first to occur to us when 
we think about the nature of learning. When an infant plays, waves its arms, or looks about, it has no 
explicit teacher, but it does have a direct sensorimotor connection to its environment. Exercising this 
connection produces a wealth of information about cause and effect, about the consequences of actions, 
and about what to do in order to achieve goals. Throughout our lives, such interactions are undoubtedly 
a major source of knowledge about our environment and ourselves. Whether we are learning to drive a 
car or to hold a conversation, we are acutely aware of how our environment responds to what we do, and 
we seek to influence what happens through our behavior. Learning from interaction is a foundational 
idea underlying nearly all theories of learning and intelligence. 

In this book we explore a computational approach to learning from interaction. Rather than directly 
theorizing about how people or animals learn, we explore idealized learning situations and evaluate the 
effectiveness of various learning methods. That is, we adopt the perspective of an artificial intelligence 
researcher or engineer. We explore designs for machines that are effective in solving learning problems of 
scientific or economic interest, evaluating the designs through mathematical analysis or computational 
experiments. The approach we explore, called reinforcement learning , is much more focused on goal- 
directed learning from interaction than are other approaches to machine learning. 


1.1 Reinforcement Learning 

Reinforcement learning is learning what to do—how to map situations to actions—so as to maximize 
a numerical reward signal. The learner is not told which actions to take, but instead must discover 
which actions yield the most reward by trying them. In the most interesting and challenging cases, 
actions may affect not only the immediate reward but also the next situation and, through that, all 
subsequent rewards. These two characteristics—trial-and-error search and delayed reward -are the two 
most important distinguishing features of reinforcement learning. 

Reinforcement learning, like many topics whose names end with “ing,” such as machine learning 
and mountaineering, is simultaneously a problem, a class of solution methods that work well on the 
problem, and the field that studies this problems and its solution methods. It is convenient to use a 
single name for all three things, but at the same time essential to keep the three conceptually separate. 
In particular, the distinction between problems and solution methods is very important in reinforcement 
learning; failing to make this distinction is the source of a many confusions. 

We formalize the problem of reinforcement learning using ideas from dynamical systems theory, 
specifically, as the optimal control of incompletely-known Markov decision processes. The details of this 
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formalization must wait until Chapter 3, but the basic idea is simply to capture the most important 
aspects of the real problem facing a learning agent interacting over time with its environment to achieve 
a goal. A learning agent must be able to sense the state of its environment to some extent and must be 
able to take actions that affect the state. The agent also must have a goal or goals relating to the state of 
the environment. Markov decision processes are intended to include just these three aspects—sensation, 
action, and goal in their simplest possible forms without trivializing any of them. Any method that 
is well suited to solving such problems we consider to be a reinforcement learning method. 

Reinforcement learning is different from supervised learning , the kind of learning studied in most 
current research in the field of machine learning. Supervised learning is learning from a training set 
of labeled examples provided by a knowledgable external supervisor. Each example is a description of 
a situation together with a specification -the label—of the correct action the system should take to 
that situation, which is often to identify a category to which the situation belongs. The object of this 
kind of learning is for the system to extrapolate, or generalize, its responses so that it acts correctly 
in situations not present in the training set. This is an important kind of learning, but alone it is 
not adequate for learning from interaction. In interactive problems it is often impractical to obtain 
examples of desired behavior that are both correct and representative of all the situations in which the 
agent has to act. In uncharted territory—where one would expect learning to be most beneficial—an 
agent must be able to learn from its own experience. 

Reinforcement learning is also different from what machine learning researchers call unsupervised 
learning, which is typically about finding structure hidden in collections of unlabeled data. The terms 
supervised learning and unsupervised learning would seem to exhaustively classify machine learning 
paradigms, but they do not. Although one might be tempted to think of reinforcement learning as a 
kind of unsupervised learning because it does not rely on examples of correct behavior, reinforcement 
learning is trying to maximize a reward signal instead of trying to find hidden structure. Uncovering 
structure in an agent’s experience can certainly be useful in reinforcement learning, but by itself does 
not address the reinforcement learning problem of maximizing a reward signal. We therefore consider 
reinforcement learning to be a third machine learning paradigm, alongside supervised learning and 
unsupervised learning and perhaps other paradigms as well. 

One of the challenges that arise in reinforcement learning, and not in other kinds of learning, is the 
trade-off between exploration and exploitation. To obtain a lot of reward, a reinforcement learning 
agent must prefer actions that it has tried in the past and found to be effective in producing reward. 
But to discover such actions, it has to try actions that it has not selected before. The agent has to 
exploit what it has already experienced in order to obtain reward, but it also has to explore in order to 
make better action selections in the future. The dilemma is that neither exploration nor exploitation 
can be pursued exclusively without failing at the task. The agent must try a variety of actions and 
progressively favor those that appear to be best. On a stochastic task, each action must be tried many 
times to gain a reliable estimate of its expected reward. The exploration-exploitation dilemma has been 
intensively studied by mathematicians for many decades, yet remains unresolved. For now, we simply 
note that the entire issue of balancing exploration and exploitation does not even arise in supervised 
and unsupervised learning, at least in their purest forms. 

Another key feature of reinforcement learning is that it explicitly considers the whole problem of a 
goal-directed agent interacting with an uncertain environment. This is in contrast to many approaches 
that consider subproblems without addressing how they might fit into a larger picture. For example, we 
have mentioned that much of machine learning research is concerned with supervised learning without 
explicitly specifying how such an ability would finally be useful. Other researchers have developed 
theories of planning with general goals, but without considering planning’s role in real-time decision 
making, or the question of where the predictive models necessary for planning would come from. Al¬ 
though these approaches have yielded many useful results, their focus on isolated subproblems is a 
significant limitation. 
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Reinforcement learning takes the opposite tack, starting with a complete, interactive, goal-seeking 
agent. All reinforcement learning agents have explicit goals, can sense aspects of their environments, 
and can choose actions to influence their environments. Moreover, it is usually assumed from the 
beginning that the agent has to operate despite significant uncertainty about the environment it faces. 
When reinforcement learning involves planning, it has to address the interplay between planning and 
real-time action selection, as well as the question of how environment models are acquired and improved. 
When reinforcement learning involves supervised learning, it does so for specific reasons that determine 
which capabilities are critical and which are not. For learning research to make progress, important 
subproblems have to be isolated and studied, but they should be subproblems that play clear roles in 
complete, interactive, goal-seeking agents, even if all the details of the complete agent cannot yet be 
filled in. 

By a complete, interactive, goal-seeking agent we do not always mean something like a complete 
organism or robot. These are clearly examples, but a complete, interactive, goal-seeking agent can also 
be a component of a larger behaving system. In this case, the agent directly interacts with the rest of 
the larger system and indirectly interacts with the larger system’s environment. A simple example is 
an agent that monitors the charge level of robot’s battery and sends commands to the robot’s control 
architecture. This agent’s environment is the rest of the robot together with the robot’s environment. 
One must look beyond the most obvious examples of agents and their environments to appreciate the 
generality of the reinforcement learning framework. 

One of the most exciting aspects of modern reinforcement learning is its substantive and fruitful 
interactions with other engineering and scientific disciplines. Reinforcement learning is part of a decades- 
long trend within artificial intelligence and machine learning toward greater integration with statistics, 
optimization, and other mathematical subjects. For example, the ability of some reinforcement learning 
methods to learn with parameterized approximators addresses the classical “curse of dimensionality” in 
operations research and control theory. More distinctively, reinforcement learning has also interacted 
strongly with psychology and neuroscience, with substantial benefits going both ways. Of all the forms 
of machine learning, reinforcement learning is the closest to the kind of learning that humans and 
other animals do, and many of the core algorithms of reinforcement learning were originally inspired by 
biological learning systems. Reinforcement learning has also given back, both through a psychological 
model of animal learning that better matches some of the empirical data, and through an influential 
model of parts of the brain’s reward system. The body of this book develops the ideas of reinforcement 
learning that pertain to engineering and artificial intelligence, with connections to psychology and 
neuroscience summarized in Chapters 14 and 15. 

Finally, reinforcement learning is also part of a larger trend in artificial intelligence back toward 
simple general principles. Since the late 1960’s, many artificial intelligence researchers presumed that 
there are no general principles to be discovered, that intelligence is instead due to the possession of 
a vast number of special purpose tricks, procedures, and heuristics. It was sometimes said that if we 
could just get enough relevant facts into a machine, say one million, or one billion, then it would become 
intelligent. Methods based on general principles, such as search or learning, were characterized as “weak 
methods,” whereas those based on specific knowledge were called “strong methods.” This view is still 
common today, but not dominant. From our point of view, it was simply premature: too little effort 
had been put into the search for general principles to conclude that there were none. Modern artificial 
intelligence now includes much research looking for general principles of learning, search, and decision 
making, as well as trying to incorporate vast amounts of domain knowledge. It is not clear how far 
back the pendulum will swing, but reinforcement learning research is certainly part of the swing back 
toward simpler and fewer general principles of artificial intelligence. 
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1.2 Examples 

A good way to understand reinforcement learning is to consider some of the examples and possible 
applications that have guided its development. 

• A master chess player makes a move. The choice is informed both by planning—anticipating 
possible replies and counterreplies—and by immediate, intuitive judgments of the desirability of 
particular positions and moves. 

• An adaptive controller adjusts parameters of a petroleum refinery’s operation in real time. The 
controller optimizes the yield/cost/quality trade-off on the basis of specified marginal costs without 
sticking strictly to the set points originally suggested by engineers. 

• A gazelle calf struggles to its feet minutes after being born. Half an hour later it is running at 20 
miles per hour. 

• A mobile robot decides whether it should enter a new room in search of more trash to collect or 
start trying to find its way back to its battery recharging station. It makes its decision based 
on the current charge level of its battery and how quickly and easily it has been able to find the 
recharger in the past. 

• Phil prepares his breakfast. Closely examined, even this apparently mundane activity reveals a 
complex web of conditional behavior and interlocking goal-subgoal relationships: walking to the 
cupboard, opening it, selecting a cereal box, then reaching for, grasping, and retrieving the box. 
Other complex, tuned, interactive sequences of behavior are required to obtain a bowl, spoon, 
and milk jug. Each step involves a series of eye movements to obtain information and to guide 
reaching and locomotion. Rapid judgments are continually made about how to carry the objects 
or whether it is better to ferry some of them to the dining table before obtaining others. Each 
step is guided by goals, such as grasping a spoon or getting to the refrigerator, and is in service 
of other goals, such as having the spoon to eat with once the cereal is prepared and ultimately 
obtaining nourishment. Whether he is aware of it or not, Phil is accessing information about the 
state of his body that determines his nutritional needs, level of hunger, and food preferences. 

These examples share features that are so basic that they are easy to overlook. All involve interaction 
between an active decision-making agent and its environment, within which the agent seeks to achieve 
a goal despite uncertainty about its environment. The agent’s actions are permitted to affect the future 
state of the environment (e.g., the next chess position, the level of reservoirs of the refinery, the robot’s 
next location and the future charge level of its battery), thereby affecting the options and opportunities 
available to the agent at later times. Correct choice requires taking into account indirect, delayed 
consequences of actions, and thus may require foresight or planning. 

At the same time, in all these examples the effects of actions cannot be fully predicted; thus the 
agent must monitor its environment frequently and react appropriately. For example, Phil must watch 
the milk he pours into his cereal bowl to keep it from overflowing. All these examples involve goals 
that are explicit in the sense that the agent can judge progress toward its goal based on what it can 
sense directly. The chess player knows whether or not he wins, the refinery controller knows how much 
petroleum is being produced, the mobile robot knows when its batteries run down, and Phil knows 
whether or not he is enjoying his breakfast. 

In all of these examples the agent can use its experience to improve its performance over time. The 
chess player refines the intuition he uses to evaluate positions, thereby improving his play; the gazelle 
calf improves the efficiency with which it can run; Phil learns to streamline making his breakfast. The 
knowledge the agent brings to the task at the start—either from previous experience with related tasks 
or built into it by design or evolution—influences what is useful or easy to learn, but interaction with 
the environment is essential for adjusting behavior to exploit specific features of the task. 
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1.3 Elements of Reinforcement Learning 


Beyond the agent and the environment, one can identify four main subclements of a reinforcement 
learning system: a policy, a reward signal, a value function, and, optionally, a model of the environment. 

A policy defines the learning agent’s way of behaving at a given time. Roughly speaking, a policy 
is a mapping from perceived states of the environment to actions to be taken when in those states. 
It corresponds to what in psychology would be called a set of stimulus-response rules or associations. 
In some cases the policy may be a simple function or lookup table, whereas in others it may involve 
extensive computation such as a search process. The policy is the core of a reinforcement learning agent 
in the sense that it alone is sufficient to determine behavior. In general, policies may be stochastic. 

A reward signal defines the goal in a reinforcement learning problem. On each time step, the envi¬ 
ronment sends to the reinforcement learning agent a single number called the reward. The agent’s sole 
objective is to maximize the total reward it receives over the long run. The reward signal thus defines 
what are the good and bad events for the agent. In a biological system, we might think of rewards as 
analogous to the experiences of pleasure or pain. They are the immediate and defining features of the 
problem faced by the agent. The reward signal is the primary basis for altering the policy; if an action 
selected by the policy is followed by low reward, then the policy may be changed to select some other 
action in that situation in the future. In general, reward signals may be stochastic functions of the state 
of the environment and the actions taken. 

Whereas the reward signal indicates what is good in an immediate sense, a value function specifies 
what is good in the long run. Roughly speaking, the value of a state is the total amount of reward an 
agent can expect to accumulate over the future, starting from that state. Whereas rewards determine 
the immediate, intrinsic desirability of environmental states, values indicate the long-term desirability 
of states after taking into account the states that are likely to follow, and the rewards available in those 
states. For example, a state might always yield a low immediate reward but still have a high value 
because it is regularly followed by other states that yield high rewards. Or the reverse could be true. 
To make a human analogy, rewards are somewhat like pleasure (if high) and pain (if low), whereas 
values correspond to a more refined and farsighted judgment of how pleased or displeased we are that 
our environment is in a particular state. Expressed this way, we hope it is clear that value functions 
formalize a basic and familiar idea. 

Rewards are in a sense primary, whereas values, as predictions of rewards, are secondary. Without 
rewards there could be no values, and the only purpose of estimating values is to achieve more reward. 
Nevertheless, it is values with which we are most concerned when making and evaluating decisions. 
Action choices are made based on value judgments. We seek actions that bring about states of highest 
value, not highest reward, because these actions obtain the greatest amount of reward for us over the 
long run. Unfortunately, it is much harder to determine values than it is to determine rewards. Rewards 
are basically given directly by the environment, but values must be estimated and re-estimated from the 
sequences of observations an agent makes over its entire lifetime. In fact, the most important component 
of almost all reinforcement learning algorithms we consider is a method for efficiently estimating values. 
The central role of value estimation is arguably the most important thing we have learned about 
reinforcement learning over the last few decades. 

The fourth and final element of some reinforcement learning systems is a model of the environment. 
This is something that mimics the behavior of the environment, or more generally, that allows inferences 
to be made about how the environment will behave. For example, given a state and action, the model 
might predict the resultant next state and next reward. Models are used for planning, by which we 
mean any way of deciding on a course of action by considering possible future situations before they are 
actually experienced. Methods for solving reinforcement learning problems that use models and planning 
are called model-based methods, as opposed to simpler model-free methods that are explicitly trial-and- 
error learners—viewed as almost the opposite of planning. In Chapter 8 we explore reinforcement 
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learning systems that simultaneously learn by trial and error, learn a model of the environment, and 
use the model for planning. Modern reinforcement learning spans the spectrum from low-level, trial- 
and-error learning to high-level, deliberative planning. 


1.4 Limitations and Scope 

From the preceding discussion, it should be clear that reinforcement learning relies heavily on the 
concept of state—as input to the policy and value function, and as both input to and output from the 
model. Informally, we can think of the state as a signal conveying to the agent some sense of “how the 
environment is” at a particular time. The formal definition of state as we use it here is given by the 
framework of Markov decision processes presented in Chapter 3. More generally, however, we encourage 
the reader to follow the informal meaning and think of the state as whatever information is available 
to the agent about its environment. In effect, we assume that the state signal is produced by some 
preprocessing system that is nominally part of the agent’s environment. We do not address the issues 
of constructing, changing, or learning the state signal in this book. We take this approach not because 
we consider state representation to be unimportant, but in order to focus fully on the decision-making 
issues. In other words, our main concern is not with designing the state signal, but with deciding what 
action to take as a function of whatever state signal is available. (We do touch briefly on state design 
and construction in the last chapter in Section 17.3.) 

Most of the reinforcement learning methods we consider in this book are structured around estimating 
value functions, but it is not strictly necessary to do this to solve reinforcement learning problems. 
For example, methods such as genetic algorithms, genetic programming, simulated annealing, and 
other optimization methods have been used to approach reinforcement learning problems without ever 
appealing to value functions. These methods evaluate the “lifetime” behavior of many non-learning 
agents, each using a different policy for interacting with its environment, and select those that are able 
to obtain the most reward. We call these evolutionary methods because their operation is analogous 
to the way biological evolution produces organisms with skilled behavior even when they do not learn 
during their individual lifetimes. If the space of policies is sufficiently small, or can be structured so 
that good policies are common or easy to find—or if a lot of time is available for the search—then 
evolutionary methods can be effective. In addition, evolutionary methods have advantages on problems 
in which the learning agent cannot sense the complete state of its environment. 

Our focus is on reinforcement learning methods that learn while interacting with the environment, 
which evolutionary methods do not do. Methods able to take advantage of the details of individual 
behavioral interactions can be much more efficient than evolutionary methods in many cases. Evolu¬ 
tionary methods ignore much of the useful structure of the reinforcement learning problem: they do 
not use the fact that the policy they are searching for is a function from states to actions; they do 
not notice which states an individual passes through during its lifetime, or which actions it selects. In 
some cases this information can be misleading (e.g., when states are misperceived), but more often it 
should enable more efficient search. Although evolution and learning share many features and naturally 
work together, we do not consider evolutionary methods by themselves to be especially well suited to 
reinforcement learning problems and, accordingly, we do not cover them in this book. 

However, we do include some methods that, like evolutionary methods, do not appeal to value 
functions. These methods search in spaces of policies defined by a collection of numerical parameters. 
They estimate the directions the parameters should be adjusted in order to most rapidly improve 
a policy’s performance. Unlike evolutionary methods, however, they produce these estimates while 
the agent is interacting with its environment and so can take advantage of the details of individual 
behavioral interactions. Methods like this have proven useful in many problems, and some of the 
simplest reinforcement learning methods fall into this category (see Chapter 13). In the end, however, 
the best methods of this type tend to include value functions in some form. 
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1.5 An Extended Example: Tic-Tac-Toe 


To illustrate the general idea of reinforcement learning and contrast it with other approaches, we next 
consider a single example in more detail. 


Consider the familiar child’s game of tic-tac-toe. Two players take turns 
playing on a three-by-three board. One player plays Xs and the other 
Os until one player wins by placing three marks in a row, horizontally, 
vertically, or diagonally, as the X player has in the game shown to the 
right. If the board fills up with neither player getting three in a row, the 
game is a draw. Because a skilled player can play so as never to lose, let us 
assume that we are playing against an imperfect player, one whose play is 
sometimes incorrect and allows us to win. For the moment, in fact, let us 
consider draws and losses to be equally bad for us. How might we construct 

a player that will find the imperfections in its opponent’s play and learn to maximize its chances of 
winning? 



Although this is a simple problem, it cannot readily be solved in a satisfactory way through classical 
techniques. For example, the classical “minimax” solution from game theory is not correct here because 
it assumes a particular way of playing by the opponent. For example, a minimax player would never 
reach a game state from which it could lose, even if in fact it always won from that state because of 
incorrect play by the opponent. Classical optimization methods for sequential decision problems, such 
as dynamic programming, can compute an optimal solution for any opponent, but require as input a 
complete specification of that opponent, including the probabilities with which the opponent makes each 
move in each board state. Let us assume that this information is not available a priori for this problem, 
as it is not for the vast majority of problems of practical interest. On the other hand, such information 
can be estimated from experience, in this case by playing many games against the opponent. About 
the best one can do on this problem is first to learn a model of the opponent’s behavior, up to some 
level of confidence, and then apply dynamic programming to compute an optimal solution given the 
approximate opponent model. In the end, this is not that different from some of the reinforcement 
learning methods we examine later in this book. 


An evolutionary method applied to this problem would directly search the space of possible policies 
for one with a high probability of winning against the opponent. Here, a policy is a rule that tells 
the player what move to make for every state of the game—every possible configuration of Xs and 
Os on the three-by-three board. For each policy considered, an estimate of its winning probability 
would be obtained by playing some number of games against the opponent. This evaluation would then 
direct which policy or policies were considered next. A typical evolutionary method would hill-climb 
in policy space, successively generating and evaluating policies in an attempt to obtain incremental 
improvements. Or, perhaps, a genetic-style algorithm could be used that would maintain and evaluate 
a population of policies. Literally hundreds of different optimization methods could be applied. 


Here is how the tic-tac-toe problem would be approached with a method making use of a value 
function. First we set up a table of numbers, one for each possible state of the game. Each number will 
be the latest estimate of the probability of our winning from that state. We treat this estimate as the 
state’s value , and the whole table is the learned value function. State A has higher value than state B, 
or is considered “better” than state B, if the current estimate of the probability of our winning from A 
is higher than it is from B. Assuming we always play Xs, then for all states with three Xs in a row the 
probability of winning is 1, because we have already won. Similarly, for all states with three Os in a 
row, or that are “filled up,” the correct probability is 0, as we cannot win from them. We set the initial 
values of all the other states to 0.5, representing a guess that we have a 50% chance of winning. 


We play many games against the opponent. To select our moves we examine the states that would 
result from each of our possible moves (one for each blank space on the board) and look up their current 
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values in the table. Most of the time we move greedily , selecting the move that leads to the state with 
greatest value, that is, with the highest estimated probability of winning. Occasionally, however, we 
select randomly from among the other moves instead. These are called exploratory moves because 
they cause us to experience states that we might otherwise never see. A sequence of moves made and 
considered during a game can be diagrammed as in Figure 1.1. 


opponent's move 


our move 


opponent's move 


our move 


opponent's move 


our move 


{ 

{ 

{ 

{ 

{ 

{ 


starting position 



Figure 1.1: A sequence of tic-tac-toe moves. The solid lines represent the moves taken during a game; the 
dashed lines represent moves that we (our reinforcement learning player) considered but did not make. Our 
second move was an exploratory move, meaning that it was taken even though another sibling move, the one 
leading to e*, was ranked higher. Exploratory moves do not result in any learning, but each of our other moves 
does, causing updates as suggested by the curved arrow in which estimated values are moved up the tree from 
later nodes to earlier as detailed in the text. 


While we are playing, we change the values of the states in which we find ourselves during the game. 
We attempt to make them more accurate estimates of the probabilities of winning. To do this, we “back 
up” the value of the state after each greedy move to the state before the move, as suggested by the 
arrows in Figure 1.1. More precisely, the current value of the earlier state is updated to be closer to 
the value of the later state. This can be done by moving the earlier state’s value a fraction of the way 
toward the value of the later state. If we let s denote the state before the greedy move, and s' the state 
after the move, then the update to the estimated value of s, denoted U(s), can be written as 


U(s) <— V(s) + a V(s') — V (s) 


where a is a small positive fraction called the step-size parameter , which influences the rate of learning. 
This update rule is an example of a temporal-difference learning method, so called because its changes 
are based on a difference, U(s') — V(s), between estimates at two different times. 

The method described above perforins quite well on this task. For example, if the step-size param¬ 
eter is reduced properly over time, then this method converges, for any fixed opponent, to the true 
probabilities of winning from each state given optimal play by our player. Furthermore, the moves then 
taken (except on exploratory moves) are in fact the optimal moves against this (imperfect) opponent. 
In other words, the method converges to an optimal policy for playing the game against this opponent. 
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If the step-size parameter is not reduced all the way to zero over time, then this player also plays well 
against opponents that slowly change their way of playing. 

This example illustrates the differences between evolutionary methods and methods that learn value 
functions. To evaluate a policy an evolutionary method holds the policy fixed and plays many games 
against the opponent, or simulates many games using a model of the opponent. The frequency of wins 
gives an unbiased estimate of the probability of winning with that policy, and can be used to direct 
the next policy selection. But each policy change is made only after many games, and only the final 
outcome of each game is used: what happens during the games is ignored. For example, if the player 
wins, then all of its behavior in the game is given credit, independently of how specific moves might have 
been critical to the win. Credit is even given to moves that never occurred! Value function methods, in 
contrast, allow individual states to be evaluated. In the end, evolutionary and value function methods 
both search the space of policies, but learning a value function takes advantage of information available 
during the course of play. 

This simple example illustrates some of the key features of reinforcement learning methods. First, 
there is the emphasis on learning while interacting with an environment, in this case with an opponent 
player. Second, there is a clear goal, and correct behavior requires planning or foresight that takes into 
account delayed effects of one’s choices. For example, the simple reinforcement learning player would 
learn to set up multi-move traps for a shortsighted opponent. It is a striking feature of the reinforcement 
learning solution that it can achieve the effects of planning and lookahead without using a model of 
the opponent and without conducting an explicit search over possible sequences of future states and 
actions. 

While this example illustrates some of the key features of reinforcement learning, it is so simple that 
it might give the impression that reinforcement learning is more limited than it really is. Although tic- 
tac-toe is a two-person game, reinforcement learning also applies in the case in which there is no external 
adversary, that is, in the case of a “game against nature.” Reinforcement learning also is not restricted 
to problems in which behavior breaks down into separate episodes, like the separate games of tic-tac-toe, 
with reward only at the end of each episode. It is just as applicable when behavior continues indefinitely 
and when rewards of various magnitudes can be received at any time. Reinforcement learning is also 
applicable to problems that do not even break down into discrete time steps, like the plays of tic-tac- 
toe. The general principles apply to continuous-time problems as well, although the theory gets more 
complicated and we omit it from this introductory treatment. 

Tic-tac-toe has a relatively small, finite state set, whereas reinforcement learning can be used when 
the state set is very large, or even infinite. For example, Gerry Tesauro (1992, 1995) combined the 
algorithm described above with an artificial neural network to learn to play backgammon, which has 
approximately 10 20 states. With this many states it is impossible ever to experience more than a small 
fraction of them. Tesauro’s program learned to play far better than any previous program, and now 
plays at the level of the world’s best human players (see Chapter 16). The neural network provides 
the program with the ability to generalize from its experience, so that in new states it selects moves 
based on information saved from similar states faced in the past, as determined by its network. How 
well a reinforcement learning system can work in problems with such large state sets is intimately tied 
to how appropriately it can generalize from past experience. It is in this role that we have the greatest 
need for supervised learning methods with reinforcement learning. Neural networks and deep learning 
(Section 9.6) are not the only, or necessarily the best, way to do this. 

In this tic-tac-toe example, learning started with no prior knowledge beyond the rules of the game, 
but reinforcement learning by no means entails a tabula rasa view of learning and intelligence. On the 
contrary, prior information can be incorporated into reinforcement learning in a variety of ways that 
can be critical for efficient learning. We also had access to the true state in the tic-tac-toe example, 
whereas reinforcement learning can also be applied when part of the state is hidden, or when different 
states appear to the learner to be the same. 
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Finally, the tic-tac-toe player was able to look ahead and know the states that would result from each 
of its possible moves. To do this, it had to have a model of the game that allowed it to “think about” 
how its environment would change in response to moves that it might never make. Many problems 
are like this, but in others even a short-term model of the effects of actions is lacking. Reinforcement 
learning can be applied in either case. No model is required, but models can easily be used if they are 
available or can be learned (Chapter 8). 

On the other hand, there are reinforcement learning methods that do not need any kind of environ¬ 
ment model at all. Model-free systems cannot even think about how their environments will change in 
response to a single action. The tic-tac-toe player is model-free in this sense with respect to its oppo¬ 
nent: it has no model of its opponent of any kind. Because models have to be reasonably accurate to be 
useful, model-free methods can have advantages over more complex methods when the real bottleneck in 
solving a problem is the difficulty of constructing a sufficiently accurate environment model. Model-free 
methods are also important building blocks for model-based methods. In this book we devote several 
chapters to model-free methods before we discuss how they can be used as components of more complex 
model-based methods. 

Reinforcement learning can be used at both high and low levels in a system. Although the tic-tac- 
toe player learned only about the basic moves of the game, nothing prevents reinforcement learning 
from working at higher levels where each of the “actions” may itself be the application of a possibly 
elaborate problem-solving method. In hierarchical learning systems, reinforcement learning can work 
simultaneously on several levels. 

Exercise 1.1: Self-Play Suppose, instead of playing against a random opponent, the reinforcement 
learning algorithm described above played against itself, with both sides learning. What do you think 
would happen in this case? Would it learn a different policy for selecting moves? □ 

Exercise 1.2: Symmetries Many tic-tac-toe positions appear different but are really the same because 
of symmetries. How might we amend the learning process described above to take advantage of this? 
In what ways would this change improve the learning process? Now think again. Suppose the opponent 
did not take advantage of symmetries. In that case, should we? Is it true, then, that symmetrically 
equivalent positions should necessarily have the same value? □ 

Exercise 1.3: Greedy Play Suppose the reinforcement learning player was greedy, that is, it always 
played the move that brought it to the position that it rated the best. Might it learn to play better, or 
worse, than a nongreedy player? What problems might occur? □ 

Exercise 1.4: Learning from Exploration Suppose learning updates occurred after all moves, including 
exploratory moves. If the step-size parameter is appropriately reduced over time (but not the tendency 
to explore), then the state values would converge to a set of probabilities. What are the two sets of 
probabilities computed when we do, and when we do not, learn from exploratory moves? Assuming 
that we do continue to make exploratory moves, which set of probabilities might be better to learn? 
Which would result in more wins? □ 

Exercise 1.5: Other Improvements Can you think of other ways to improve the reinforcement learning 
player? Can you think of any better way to solve the tic-tac-toe problem as posed? □ 


1.6 Summary 

Reinforcement learning is a computational approach to understanding and automating goal-directed 
learning and decision making. It is distinguished from other computational approaches by its emphasis 
on learning by an agent from direct interaction with its environment, without relying on exemplary 
supervision or complete models of the environment. In our opinion, reinforcement learning is the first 
field to seriously address the computational issues that arise when learning from interaction with an 
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environment in order to achieve long-term goals. 

Reinforcement learning uses the formal framework of Markov decision processes to define the inter¬ 
action between a learning agent and its environment in terms of states, actions, and rewards. This 
framework is intended to be a simple way of representing essential features of the artificial intelligence 
problem. These features include a sense of cause and effect, a sense of uncertainty and nondeterminism, 
and the existence of explicit goals. 

The concepts of value and value functions are the key features of most of the reinforcement learning 
methods that we consider in this book. We take the position that value functions are important for 
efficient search in the space of policies. The use of value functions distinguishes reinforcement learning 
methods from evolutionary methods that search directly in policy space guided by scalar evaluations of 
entire policies. 


1.7 Early History of Reinforcement Learning 

The early history of reinforcement learning has two main threads, both long and rich, that were pursued 
independently before intertwining in modern reinforcement learning. One thread concerns learning by 
trial and error that started in the psychology of animal learning. This thread runs through some of 
the earliest work in artificial intelligence and led to the revival of reinforcement learning in the early 
1980s. The other thread concerns the problem of optimal control and its solution using value functions 
and dynamic programming. For the most part, this thread did not involve learning. Although the 
two threads have been largely independent, the exceptions revolve around a third, less distinct thread 
concerning temporal-difference methods such as the one used in the tic-tac-toe example in this chapter. 
All three threads came together in the late 1980s to produce the modern field of reinforcement learning 
as we present it in this book. 

The thread focusing on trial-and-error learning is the one with which we are most familiar and about 
which we have the most to say in this brief history. Before doing that, however, we briefly discuss the 
optimal control thread. 

The term “optimal control” came into use in the late 1950s to describe the problem of designing a 
controller to minimize a measure of a dynamical system’s behavior over time. One of the approaches 
to this problem was developed in the mid-1950s by Richard Bellman and others through extending a 
nineteenth century theory of Hamilton and Jacobi. This approach uses the concepts of a dynamical 
system’s state and of a value function, or “optimal return function,” to define a functional equation, now 
often called the Bellman equation. The class of methods for solving optimal control problems by solving 
this equation came to be known as dynamic programming (Bellman, 1957a). Bellman (1957b) also 
introduced the discrete stochastic version of the optimal control problem known as Markovian decision 
processes (MDPs), and Ronald Howard (1960) devised the policy iteration method for MDPs. All of 
these are essential elements underlying the theory and algorithms of modern reinforcement learning. 

Dynamic programming is widely considered the only feasible way of solving general stochastic optimal 
control problems. It suffers from what Bellman called “the curse of dimensionality,” meaning that its 
computational requirements grow exponentially with the number of state variables, but it is still far more 
efficient and more widely applicable than any other general method. Dynamic programming has been 
extensively developed since the late 1950s, including extensions to partially observable MDPs (surveyed 
by Lovejoy, 1991), many applications (surveyed by White, 1985, 1988, 1993), approximation methods 
(surveyed by Rust, 1996), and asynchronous methods (Bertsekas, 1982, 1983). Many excellent modern 
treatments of dynamic programming are available (e.g., Bertsekas, 2005, 2012; Puterman, 1994; Ross, 
1983; and Whittle, 1982, 1983). Bryson (1996) provides an authoritative history of optimal control. 

Connections between optimal control and dynamic programming, on the one hand, and learning, on 
the other, were slow to be recognized. We cannot be sure about what accounted for this separation, 
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but its main cause was likely the separation between the disciplines involved and their different goals. 
Also contributing may have been the prevalent view of dynamic programming as an off-line computa¬ 
tion depending essentially on accurate system models and analytic solutions to the Bellman equation. 
Further, the simplest form of dynamic programming is a computation that proceeds backwards in time, 
making it difficult to see how it could be involved in a learning process that must proceed in a forward 
direction. Some of the earliest work in dynamic programming, such as that by Bellman and Dreyfus 
(1959) might now be classified as following a learning approach. Witten’s (1977) work (discussed be¬ 
low) certainly qualifies as a combination of learning and dynamic-programming ideas. Werbos (1987) 
argued explicitly greater interrelation of dynamic programming and learning methods and its relevance 
to understanding neural and cognitive mechanisms. For us the full integration of dynamic programming 
methods with on-line learning did not occur until the work of Chris Watkins in 1989, whose treatment 
of reinforcement learning using the MDP formalism has been widely adopted (Watkins, 1989). Since 
then these relationships have been extensively developed by many researchers, most particularly by 
Dimitri Bertsekas and John Tsitsiklis (1996), who coined the term “neurodynamic programming” to 
refer to the combination of dynamic programming and neural networks. Another term currently in 
use is “approximate dynamic programming.” These various approaches emphasize different aspects of 
the subject, but they all share with reinforcement learning an interest in circumventing the classical 
shortcomings of dynamic programming. 

We would consider all of the work in optimal control also to be, in a sense, work in reinforcement learn¬ 
ing. We define a reinforcement learning method as any effective way of solving reinforcement learning 
problems, and it is now clear that these problems are closely related to optimal control problems, par¬ 
ticularly stochastic optimal control problems such as those formulated as MDPs. Accordingly, we must 
consider the solution methods of optimal control, such as dynamic programming, also to be reinforce¬ 
ment learning methods. Because almost all of the conventional methods require complete knowledge 
of the system to be controlled, it feels a little unnatural to say that they are part of reinforcement 
learning. On the other hand, many dynamic programming algorithms are incremental and iterative. 
Like learning methods, they gradually reach the correct answer through successive approximations. As 
we show in the rest of this book, these similarities are far more than superficial. The theories and 
solution methods for the cases of complete and incomplete knowledge are so closely related that we feel 
they must be considered together as part of the same subject matter. 

Let us return now to the other major thread leading to the modern Held of reinforcement learning, 
that centered on the idea of trial-and-error learning. We only touch on the major points of contact 
here, taking up this topic in more detail in Chapter 14. According to American psychologist R. S. 
Woodworth the idea of trial-and-error learning goes as far back as the 1850s to Alexander Bain’s 
discussion of learning by “groping and experiment” and more explicitly to the British ethologist and 
psychologist Conway Lloyd Morgan’s 1894 use of the term to describe his observations of animal behavior 
(Woodworth, 1938). Perhaps the first to succinctly express the essence of trial-and-error learning as a 
principle of learning was Edward Thorndike: 

Of several responses made to the same situation, those which are accompanied or closely 
followed by satisfaction to the animal will, other things being equal, be more firmly connected 
with the situation, so that, when it recurs, they will be more likely to recur; those which are 
accompanied or closely followed by discomfort to the animal will, other things being equal, 
have their connections with that situation weakened, so that, when it recurs, they will be 
less likely to occur. The greater the satisfaction or discomfort, the greater the strengthening 
or weakening of the bond. (Thorndike, 1911, p. 244) 

Thorndike called this the “Law of Effect” because it describes the effect of reinforcing events on the 
tendency to select actions. Thorndike later modified the law to better account for accumulating data 
on animal learning (such as differences between the effects of reward and punishment), and the law in 
its various forms has generated considerable controversy among learning theorists (e.g., see Gallistel, 
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2005; Herrnstein, 1970; Kimble, 1961, 1967; Mazur, 1994). Despite this, the Law of Effect- in one form 
or another—is widely regarded as a basic principle underlying much behavior (e.g., Hilgard and Bower, 
1975; Dennett, 1978; Campbell, 1960; Cziko, 1995). It is the basis of the influential learning theories of 
Clark Hull and experimental methods of B. F. Skinner (e.g., Hull, 1943; Skinner, 1938). 

The term “reinforcement” in the context of animal learning came into use well after Thorndike’s 
expression of the Law of Effect, to the best of our knowledge first appearing in this context in the 1927 
English translation of Pavlov’s monograph on conditioned reflexes. Reinforcement is the strengthening 
of a pattern of behavior as a result of an animal receiving a stimulus—a reinforcer—in an appropri¬ 
ate temporal relationship with another stimulus or with a response. Some psychologists extended its 
meaning to include the process of weakening in addition to strengthening, as well applying when the 
omission or termination of an event changes behavior. Reinforcement produces changes in behavior 
that persist after the reinforcer is withdrawn, so that a stimulus that attracts an animal’s attention or 
that energizes its behavior without producing lasting changes is not considered to be a reinforcer. 

The idea of implementing trial-and-error learning in a computer appeared among the earliest thoughts 
about the possibility of artificial intelligence. In a 1948 report, Alan Turing described a design for a 
“pleasure-pain system” that worked along the lines of the Law of Effect: 

When a configuration is reached for which the action is undetermined, a random choice for 
the missing data is made and the appropriate entry is made in the description, tentatively, 
and is applied. When a pain stimulus occurs all tentative entries are cancelled, and when a 
pleasure stimulus occurs they are all made permanent. (Turing, 1948) 

Many ingenious electro-mechanical machines were constructed that demonstrated trial-and-error learn¬ 
ing. The earliest may have been a machine built by Thomas Ross (1933) that was able to find its way 
through a simple maze and remember the path through the settings of switches. In 1951 W. Grey 
Walter, already known for his “mechanical tortoise” (Walter, 1950), built a version capable of a simple 
form of learning (Walter, 1951). In 1952 Claude Shannon demonstrated a maze-running mouse named 
Theseus that used trial and error to find its way through a maze, with the maze itself remembering the 
successful directions via magnets and relays under its floor (Shannon, 1951, 1952). J. A. Deutsch (1954) 
described a maze-solving machine based on his behavior theory (Deutsch, 1953) that has some proper¬ 
ties in common with model-based reinforcement learning (Chapter 8). In his Pli.D. dissertation, Marvin 
Minsky (1954) discussed computational models of reinforcement learning and described his construction 
of an analog machine composed of components he called SNARCs (Stochastic Neural-Analog Reinforce¬ 
ment Calculators) meant to resemble modifiable synaptic connections in the brain (Chapter 15) The 
fascinating web site cyberneticzoo.com contains a wealth of information on these and many other 
electro-mechanical learning machines. 

Building electro-mechanical learning machines gave way to programming digital computers to perform 
various types of learning, some of which implemented trial-and-error learning. Farley and Clark (1954) 
described a digital simulation of a neural-network learning machine that learned by trial and error. But 
their interests soon shifted from trial-and-error learning to generalization and pattern recognition, that 
is, from reinforcement learning to supervised learning (Clark and Farley, 1955). This began a pattern of 
confusion about the relationship between these types of learning. Many researchers seemed to believe 
that they were studying reinforcement learning when they were actually studying supervised learning. 
For example, neural network pioneers such as Rosenblatt (1962) and Widrow and Hoff (1960) were 
clearly motivated by reinforcement learning—they used the language of rewards and punishments— 
but the systems they studied were supervised learning systems suitable for pattern recognition and 
perceptual learning. Even today, some researchers and textbooks minimize or blur the distinction 
between these types of learning. For example, some neural-network textbooks have used the term 
“trial-and-error” to describe networks that learn from training examples. This is an understandable 
confusion because these networks use error information to update connection weights, but this misses 
the essential character of trial-and-error learning as selecting actions on the basis of evaluative feedback 
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that does not rely on knowledge of what the correct action should be. 

Partly as a result of these confusions, research into genuine trial-and-error learning became rare in 
the 1960s and 1970s, although there were notable exceptions. In the 1960s the terms “reinforcement” 
and “reinforcement learning” were used in the engineering literature for the first time to describe 
engineering uses of trial-and-error learning (e.g., Waltz and Fu, 1965; Mendel, 1966; Fu, 1970; Mendel 
and McClaren, 1970). Particularly influential was Minsky’s paper “Steps Toward Artificial Intelligence” 
(Minsky, 1961), which discussed several issues relevant to trial-and-error learning, including prediction, 
expectation, and what he called the basic credit-assignment problem for complex reinforcement learning 
systems'. How do you distribute credit for success among the many decisions that may have been 
involved in producing it? All of the methods we discuss in this book are, in a sense, directed toward 
solving this problem. Minsky’s paper is well worth reading today. 

In the next few paragraphs we discuss some of the other exceptions and partial exceptions to the 
relative neglect of computational and theoretical study of genuine trial-and-error learning in the 1960s 
and 1970s. 

One of these was the work by a New Zealand researcher named John Andreae. Andreae (1963) 
developed a system called STeLLA that learned by trial and error in interaction with its environment. 
This system included an internal model of the world and, later, an “internal monologue” to deal with 
problems of hidden state (Andreae, 1969a). Andreae’s later work (1977) placed more emphasis on 
learning from a teacher, but still included learning by trial and error, with the generation of novel 
events being one of the system’s goals. A feature of this work was a “leakback process,” elaborated 
more fully in Andreae (1998), that implemented a credit-assignment mechanism similar to the backing- 
up update operations that we describe. Unfortunately, his pioneering research was not well known, and 
did not greatly impact subsequent reinforcement learning research. 

More influential was the work of Donald Michie. In 1961 and 1963 he described a simple trial-and- 
error learning system for learning how to play tic-tac-toe (or naughts and crosses) called MENACE (for 
Matchbox Educable Naughts and Crosses Engine). It consisted of a matchbox for each possible game 
position, each matchbox containing a number of colored beads, a different color for each possible move 
from that position. By drawing a bead at random from the matchbox corresponding to the current 
game position, one could determine MENACE’S move. When a game was over, beads were added 
to or removed from the boxes used during play to reinforce or punish MENACE’S decisions. Michie 
and Chambers (1968) described another tic-tac-toe reinforcement learner called GLEE (Game Learning 
Expectimaxing Engine) and a reinforcement learning controller called BOXES. They applied BOXES to 
the task of learning to balance a pole hinged to a movable cart on the basis of a failure signal occurring 
only when the pole fell or the cart reached the end of a track. This task was adapted from the earlier 
work of Widrow and Smith (1964), who used supervised learning methods, assuming instruction from 
a teacher already able to balance the pole. Michie and Chambers’s version of pole-balancing is one of 
the best early examples of a reinforcement learning task under conditions of incomplete knowledge. It 
influenced much later work in reinforcement learning, beginning with some of our own studies (Barto, 
Sutton, and Anderson, 1983; Sutton, 1984). Michie consistently emphasized the role of trial and error 
and learning as essential aspects of artificial intelligence (Michie, 1974). 

Widrow, Gupta, and Maitra (1973) modified the Least-Mean-Square (LMS) algorithm of Widrow and 
Hoff (1960) to produce a reinforcement learning rule that could learn from success and failure signals 
instead of from training examples. They called this form of learning “selective bootstrap adaptation” 
and described it as “learning with a critic” instead of “learning with a teacher.” They analyzed this rule 
and showed how it could learn to play blackjack. This was an isolated foray into reinforcement learning 
by Widrow, whose contributions to supervised learning were much more influential. Our use of the term 
“critic” is derived from Widrow, Gupta, and Maitra’s paper. Buchanan, Mitchell, Smith, and Johnson 
(1978) independently used the term critic in the context of machine learning (see also Dietterich and 
Buchanan, 1984), but for them a critic is an expert system able to do more than evaluate performance. 
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Research on learning automata had a more direct influence on the trial-and-error thread leading 
to modern reinforcement learning research. These are methods for solving a nonassociative, purely 
selectional learning problem known as the k-armed bandit by analogy to a slot machine, or “one-armed 
bandit,” except with k levers (see Chapter 2). Learning automata are simple, low-memory machines for 
improving the probability of reward in these problems. Learning automata originated with work in the 
1960s of the Russian mathematician and physicist M. L. Tsetlin and colleagues (published posthumously 
in Tsetlin, 1973) and has been extensively developed since then within engineering (see Narendra and 
Thatlrachar, 1974, 1989). These developments included the study of stochastic learning automata, 
which are methods for updating action probabilities on the basis of reward signals. Stochastic learning 
automata were foreshadowed by earlier work in psychology, beginning with William Estes’ 1950 effort 
toward a statistical theory of learning (Estes, 1950) and further developed by others, most famously by 
psychologist Robert Bush and statistician Frederick Mosteller (Bush and Mosteller, 1955). 

The statistical learning theories developed in psychology were adopted by researchers in economics, 
leading to a thread of research in that field devoted to reinforcement learning. This work began in 1973 
with the application of Bush and Mosteller’s learning theory to a collection of classical economic models 
(Cross, 1973). One goal of this research was to study artificial agents that act more like real people 
than do traditional idealized economic agents (Arthur, 1991). This approach expanded to the study of 
reinforcement learning in the context of game theory. Although reinforcement learning in economics 
developed largely independently of the early work in artificial intelligence, reinforcement learning and 
game theory is a topic of current interest in both fields, but one that is beyond the scope of this book. 
Camerer (2003) discusses the reinforcement learning tradition in economics, and Nowe et al. (2012) 
provide an overview of the subject from the point of view of multi-agent extensions to the approach 
that we introduce in this book. Reinforcement in the context of game theory is a much different subject 
than reinforcement learning used in programs to play tic-tac-toe, checkers, and other recreational games. 
See, for example, Szita (2012) for an overview of this aspect of reinforcement learning and games. 

John Holland (1975) outlined a general theory of adaptive systems based on selectional principles. 
His early work concerned trial and error primarily in its nonassociative form, as in evolutionary methods 
and the fc-armed bandit. In 1976 and more fully in 1986, he introduced classifier systems, true rein¬ 
forcement learning systems including association and value functions. A key component of Holland’s 
classifier systems was the “bucket-brigade algorithm” for credit assignment that is closely related to the 
temporal difference algorithm used in our tic-tac-toe example and discussed in Chapter 6. Another key 
component was a genetic algorithm, an evolutionary method whose role was to evolve useful representa¬ 
tions. Classifier systems have been extensively developed by many researchers to form a major branch 
of reinforcement learning research (reviewed by Urbanowicz and Moore, 2009), but genetic algorithms— 
which we do not consider to be reinforcement learning systems by themselves—have received much more 
attention, as have other approaches to evolutionary computation (e.g., Fogel, Owens and Walsh, 1966, 
and Koza, 1992). 

The individual most responsible for reviving the trial-and-error thread to reinforcement learning 
within artificial intelligence was Harry Klopf (1972, 1975, 1982). Klopf recognized that essential as¬ 
pects of adaptive behavior were being lost as learning researchers came to focus almost exclusively on 
supervised learning. What was missing, according to Klopf, were the hedonic aspects of behavior, the 
drive to achieve some result from the environment, to control the environment toward desired ends and 
away from undesired ends. This is the essential idea of trial-and-error learning. Klopf’s ideas were 
especially influential on the authors because our assessment of them (Barto and Sutton, 1981a) led to 
our appreciation of the distinction between supervised and reinforcement learning, and to our eventual 
focus on reinforcement learning. Much of the early work that we and colleagues accomplished was di¬ 
rected toward showing that reinforcement learning and supervised learning were indeed different (Barto, 
Sutton, and Brouwer, 1981; Barto and Sutton, 1981b; Barto and Anandan, 1985). Other studies showed 
how reinforcement learning could address important problems in neural network learning, in particular, 
how it could produce learning algorithms for multilayer networks (Barto, Anderson, and Sutton, 1982; 
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Barto and Anderson, 1985; Barto and Anandan, 1985; Barto, 1985, 1986; Barto and Jordan, 1987). We 
say more about reinforcement learning and neural networks in Chapter 15. 

We turn now to the third thread to the history of reinforcement learning, that concerning temporal- 
difference learning. Temporal-difference learning methods are distinctive in being driven by the differ¬ 
ence between temporally successive estimates of the same quantity—for example, of the probability of 
winning in the tic-tac-toe example. This thread is smaller and less distinct than the other two, but it 
has played a particularly important role in the field, in part because temporal-difference methods seem 
to be new and unique to reinforcement learning. 

The origins of temporal-difference learning are in part in animal learning psychology, in particular, 
in the notion of secondary reinforcers. A secondary reinforcer is a stimulus that has been paired with 
a primary reinforcer such as food or pain and, as a result, has come to take on similar reinforcing 
properties. Minsky (1954) may have been the first to realize that this psychological principle could be 
important for artificial learning systems. Arthur Samuel (1959) was the first to propose and implement 
a learning method that included temporal-difference ideas, as part of his celebrated checkers-playing 
program. 

Samuel made no reference to Minsky’s work or to possible connections to animal learning. His inspira¬ 
tion apparently came from Claude Shannon’s (1950) suggestion that a computer could be programmed 
to use an evaluation function to play chess, and that it might be able to improve its play by modifying 
this function on-line. (It is possible that these ideas of Shannon’s also influenced Bellman, but we 
know of no evidence for this.) Minsky (1961) extensively discussed Samuel’s work in his “Steps” paper, 
suggesting the connection to secondary reinforcement theories, both natural and artificial. 

As we have discussed, in the decade following the work of Minsky and Samuel, little computational 
work was done on trial-and-error learning, and apparently no computational work at all was done on 
temporal-difference learning. In 1972, Klopf brought trial-and-error learning together with an impor¬ 
tant component of temporal-difference learning. Klopf was interested in principles that would scale to 
learning in large systems, and thus was intrigued by notions of local reinforcement, whereby subcompo¬ 
nents of an overall learning system could reinforce one another. He developed the idea of “generalized 
reinforcement,” whereby every component (nominally, every neuron) views all of its inputs in reinforce¬ 
ment terms: excitatory inputs as rewards and inhibitory inputs as punishments. This is not the same 
idea as what we now know as temporal-difference learning, and in retrospect it is farther from it than 
was Samuel’s work. On the other hand, Klopf linked the idea with trial-and-error learning and related 
it to the massive empirical database of animal learning psychology. 

Sutton (1978a, 1978b, 1978c) developed Klopf’s ideas further, particularly the links to animal learning 
theories, describing learning rules driven by changes in temporally successive predictions. He and Barto 
refined these ideas and developed a psychological model of classical conditioning based on temporal- 
difference learning (Sutton and Barto, 1981a; Barto and Sutton, 1982). There followed several other 
influential psychological models of classical conditioning based on temporal-difference learning (e.g., 
Klopf, 1988; Moore et ah, 1986; Sutton and Barto, 1987, 1990). Some neuroscience models developed 
at this time are well interpreted in terms of temporal-difference learning (Hawkins and Kandel, 1984; 
Byrne, Gingrich, and Baxter, 1990; Gelperin, Hopfield, and Tank, 1985; Tesauro, 1986; Friston et ah, 
1994), although in most cases there was no historical connection. 

Our early work on temporal-difference learning was strongly influenced by animal learning theories 
and by Klopf’s work. Relationships to Minsky’s “Steps” paper and to Samuel’s checkers players appear 
to have been recognized only afterward. By 1981, however, we were fully aware of all the prior work 
mentioned above as part of the temporal-difference and trial-and-error threads. At this time we de¬ 
veloped a method for using temporal-difference learning combined with trial-and-error learning, known 
as the actor-critic architecture, and applied this method to Michie and Chambers’s pole-balancing 
problem (Barto, Sutton, and Anderson, 1983). This method was extensively studied in Sutton’s (1984) 
Ph.D. dissertation and extended to use backpropagation neural networks in Anderson’s (1986) Ph.D. 
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dissertation. Around this time, Holland (1986) incorporated temporal-difference ideas explicitly into 
his classifier systems in the form of his bucket-brigade algorithm. A key step was taken by Sutton in 
1988 by separating temporal-difference learning from control, treating it as a general prediction method. 
That paper also introduced the TD(A) algorithm and proved some of its convergence properties. 

As we were finalizing our work on the actor-critic architecture in 1981, we discovered a paper by 
Ian Witten (1977) which appears to be the earliest publication of a temporal-difference learning rule. 
He proposed the method that we now call tabular TD(0) for use as part of an adaptive controller for 
solving MDPs. Witten’s work was a descendant of Andreae’s early experiments with STeLLA and 
other trial-and-error learning systems. Thus, Witten’s 1977 paper spanned both major threads of 
reinforcement learning research—trial-and-error learning and optimal control while making a distinct 
early contribution to temporal-difference learning. 

The temporal-difference and optimal control threads were fully brought together in 1989 with Chris 
Watkins’s development of Q-learning. This work extended and integrated prior work in all three threads 
of reinforcement learning research. Paul Werbos (1987) contributed to this integration by arguing for 
the convergence of trial-and-error learning and dynamic programming since 1977. By the time of 
Watkins’s work there had been tremendous growth in reinforcement learning research, primarily in the 
machine learning subfield of artificial intelligence, but also in neural networks and artificial intelligence 
more broadly. In 1992, the remarkable success of Gerry Tesauro’s backgammon playing program, TD- 
Gannnon, brought additional attention to the field. 

In the time since publication of the first edition of this book, a flourishing subfield of neuroscience 
developed that focuses on the relationship between reinforcement learning algorithms and reinforcement 
learning in the nervous system. Most responsible for this is an uncanny similarity between the behavior 
of temporal-difference algorithms and the activity of dopamine producing neurons in the brain, as 
pointed out by a number of researchers (Friston et ah, 1994; Barto, 1995a; Houk, Adams, and Barto, 
1995; Montague, Dayan, and Sejnowski, 1996; and Schultz, Dayan, and Montague, 1997). Chapter 15 
provides an introduction to this exciting aspect of reinforcement learning. 

Other important contributions made in the recent history of reinforcement learning are too numerous 
to mention in this brief account; we cite many of these at the end of the individual chapters in which 
they arise. 


Bibliographical Remarks 

For additional general coverage of reinforcement learning, we refer the reader to the books by Szepesvari 
(2010), Bertsekas and Tsitsiklis (1996), Kaelbling (1993a), and Sugiyama et al. (2013). Books that take 
a control or operations research perspective include those of Si et al. (2004), Powell (2011), Lewis and 
Liu (2012), and Bertsekas (2012). Cao’s (2009) review places reinforcement learning in the context of 
other approaches to learning and optimization of stochastic dynamic systems. Three special issues of 
the journal Machine Learning focus on reinforcement learning: Sutton (1992), Kaelbling (1996), and 
Singh (2002). Useful surveys are provided by Barto (1995b); Kaelbling, Liftman, and Moore (1996); 
and Keerthi and Ravindran (1997). The volume edited by Weiring and van Otterlo (2012) provides an 
excellent overview of recent developments. 

1.2 The example of Phil’s breakfast in this chapter was inspired by Agre (1988). 

1.5 The temporal-difference method used in the tic-tac-toe example is developed in Chapter 6. 



Part I: Tabular Solution Methods 


In this part of the book we describe almost all the core ideas of reinforcement learning algorithms 
in their simplest forms: that in which the state and action spaces are small enough for the approxi¬ 
mate value functions to be represented as arrays, or tables. In this case, the methods can often find 
exact solutions, that is, they can often find exactly the optimal value function and the optimal policy. 
This contrasts with the approximate methods described in the next part of the book, which only find 
approximate solutions, but which in return can be applied effectively to much larger problems. 

The first chapter of this part of the book describes solution methods for the special case of the 
reinforcement learning problem in which there is only a single state, called bandit problems. The 
second chapter describes the general problem formulation that we treat throughout the rest of the 
book —finite Markov decision processes—and its main ideas including Bellman equations and value 
functions. 

The next three chapters describe three fundamental classes of methods for solving finite Markov 
decision problems: dynamic programming, Monte Carlo methods, and temporal-difference learning. 
Each class of methods has its strengths and weaknesses. Dynamic programming methods are well 
developed mathematically, but require a complete and accurate model of the environment. Monte 
Carlo methods don’t require a model and are conceptually simple, but are not well suited for step- 
by-step incremental computation. Finally, temporal-difference methods require no model and are fully 
incremental, but are more complex to analyze. The methods also differ in several ways with respect to 
their efficiency and speed of convergence. 

The remaining two chapters describe how these three classes of methods can be combined to obtain 
the best features of each of them. In one chapter we describe how the strengths of Monte Carlo methods 
can be combined with the strengths of temporal-difference methods via the use of eligibility traces. In 
the final chapter of this part of the book we show how temporal-difference learning methods can be 
combined with model learning and planning methods (such as dynamic programming) for a complete 
and unified solution to the tabular reinforcement learning problem. 
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Chapter 2 


Multi-armed Bandits 


The most important feature distinguishing reinforcement learning from other types of learning is that 
it uses training information that evaluates the actions taken rather than instructs by giving correct 
actions. This is what creates the need for active exploration, for an explicit search for good behavior. 
Purely evaluative feedback indicates how good the action taken was, but not whether it was the best or 
the worst action possible. Purely instructive feedback, on the other hand, indicates the correct action 
to take, independently of the action actually taken. This kind of feedback is the basis of supervised 
learning, which includes large parts of pattern classification, artificial neural networks, and system 
identification. In their pure forms, these two kinds of feedback are quite distinct: evaluative feedback 
depends entirely on the action taken, whereas instructive feedback is independent of the action taken. 

In this chapter we study the evaluative aspect of reinforcement learning in a simplified setting, one 
that does not involve learning to act in more than one situation. This nonassociative setting is the 
one in which most prior work involving evaluative feedback has been done, and it avoids much of the 
complexity of the full reinforcement learning problem. Studying this case enables us to see most clearly 
how evaluative feedback differs from, and yet can be combined with, instructive feedback. 

The particular nonassociative, evaluative feedback problem that we explore is a simple version of 
the fc-armed bandit problem. We use this problem to introduce a number of basic learning methods 
which we extend in later chapters to apply to the full reinforcement learning problem. At the end 
of this chapter, we take a step closer to the full reinforcement learning problem by discussing what 
happens when the bandit problem becomes associative, that is, when actions are taken in more than 
one situation. 


2.1 A A;-armed Bandit Problem 

Consider the following learning problem. You are faced repeatedly with a choice among k different op¬ 
tions, or actions. After each choice you receive a numerical reward chosen from a stationary probability 
distribution that depends on the action you selected. Your objective is to maximize the expected total 
reward over some time period, for example, over 1000 action selections, or time steps. 

This is the original form of the k-armed bandit problem, so named by analogy to a slot machine, or 
“one-armed bandit,” except that it has k levers instead of one. Each action selection is like a play of one 
of the slot machine’s levers, and the rewards are the payoffs for hitting the jackpot. Through repeated 
action selections you are to maximize your winnings by concentrating your actions on the best levers. 
Another analogy is that of a doctor choosing between experimental treatments for a series of seriously 
ill patients. Each action is the selection of a treatment, and each reward is the survival or well-being 
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of the patient. Today the term “bandit problem” is sometimes used for a generalization of the problem 
described above, but in this book we use it to refer just to this simple case. 

In our fc-armed bandit problem, each of the fc actions has an expected or mean reward given that 
that action is selected; let us call this the value of that action. We denote the action selected on time 
step t as A t , and the corresponding reward as Rt- The value then of an arbitrary action a, denoted 
q*(a), is the expected reward given that a is selected: 

q*(a) = E[i? t | A t =a]. 

If you knew the value of each action, then it would be trivial to solve the fc-armed bandit problem: you 
would always select the action with highest value. We assume that you do not know the action values 
with certainty, although you may have estimates. We denote the estimated value of action a at time 
step t as Qt(a). We would like Qt{a) to be close to < 7 *(a). 

If you maintain estimates of the action values, then at any time step there is at least one action whose 
estimated value is greatest. We call these the greedy actions. When you select one of these actions, 
we say that you are exploiting your current knowledge of the values of the actions. If instead you 
select one of the nongreedy actions, then we say you are exploring , because this enables you to improve 
your estimate of the nongreedy action’s value. Exploitation is the right thing to do to maximize the 
expected reward on the one step, but exploration may produce the greater total reward in the long run. 
For example, suppose a greedy action’s value is known with certainty, while several other actions are 
estimated to be nearly as good but with substantial uncertainty. The uncertainty is such that at least 
one of these other actions probably is actually better than the greedy action, but you don’t know which 
one. If you have many time steps ahead on which to make action selections, then it may be better to 
explore the nongreedy actions and discover which of them are better than the greedy action. Reward is 
lower in the short run, during exploration, but higher in the long run because after you have discovered 
the better actions, you can exploit them many times. Because it is not possible both to explore and 
to exploit with any single action selection, one often refers to the “conflict” between exploration and 
exploitation. 

In any specific case, whether it is better to explore or exploit depends in a complex way on the precise 
values of the estimates, uncertainties, and the number of remaining steps. There are many sophisticated 
methods for balancing exploration and exploitation for particular mathematical formulations of the fc- 
armed bandit and related problems. However, most of these methods make strong assumptions about 
stationarity and prior knowledge that are either violated or impossible to verify in applications and in 
the full reinforcement learning problem that we consider in subsequent chapters. The guarantees of 
optimality or bounded loss for these methods are of little comfort when the assumptions of their theory 
do not apply. 

In this book we do not worry about balancing exploration and exploitation in a sophisticated way; we 
worry only about balancing them at all. In this chapter we present several simple balancing methods for 
the fc-armed bandit problem and show that they work much better than methods that always exploit. 
The need to balance exploration and exploitation is a distinctive challenge that arises in reinforcement 
learning; the simplicity of our version of the fc-armed bandit problem enables us to show this in a 
particularly clear form. 


2.2 Action-value Methods 

We begin by looking more closely at some simple methods for estimating the values of actions and for 
using the estimates to make action selection decisions. Recall that the true value of an action is the 
mean reward when that action is selected. One natural way to estimate this is by averaging the rewards 
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actually received: 


Qt(a) 


sum of rewards when a taken prior to t 
number of times a taken prior to t 


£*~j Ri ■ 1.4,=g 

e!=i u,=« 


( 2 . 1 ) 


where 1 predicate denotes the random variable that is 1 if predicate is true and 0 if it is not. If the 
denominator is zero, then we instead define Qt.(a) as some default value, such as 0. As the denominator 
goes to infinity, by the law of large numbers, Q t (a) converges to q*(a). We call this the sample-average 
method for estimating action values because each estimate is an average of the sample of relevant 
rewards. Of course this is just one way to estimate action values, and not necessarily the best one. 
Nevertheless, for now let us stay with this simple estimation method and turn to the question of how 
the estimates might be used to select actions. 

The simplest action selection rule is to select one of the actions with the highest estimated value, 
that is, one of the greedy actions as defined in the previous section. If there is more than one greedy 
action, then a selection is made among them in some arbitrary way, perhaps randomly. We write this 
greedy action selection method as 


A t = argma xQ t (a), (2.2) 

a 

where argmax a denotes the action a for which the expression that follows is maximized (again, with ties 
broken arbitrarily). Greedy action selection always exploits current knowledge to maximize immediate 
reward; it spends no time at all sampling apparently inferior actions to see if they might really be better. 
A simple alternative is to behave greedily most of the time, but every once in a while, say with small 
probability s, instead select randomly from among all the actions with equal probability, independently 
of the action-value estimates. We call methods using this near-greedy action selection rule e-greedy 
methods. An advantage of these methods is that, in the limit as the number of steps increases, every 
action will be sampled an infinite number of times, thus ensuring that all the Qt(a) converge to g*(a). 
This of course implies that the probability of selecting the optimal action converges to greater than 
1 — e, that is, to near certainty. These are just asymptotic guarantees, however, and say little about 
the practical effectiveness of the methods. 

Exercise 2.1 In e-greedy action selection, for the case of two actions and e = 0.5, what is the 
probability that the greedy action is selected? 

Exercise 2.2: Bandit example Consider a fc-armed bandit problem with k = 4 actions, denoted 
1, 2, 3, and 4. Consider applying to this problem a bandit algorithm using e-greedy action selection, 
sample-average action-value estimates, and initial estimates of Q\(a) = 0, for all a. Suppose the initial 
sequence of actions and rewards is A 4 = 1, Ri = 1, A 2 = 2, R 2 = 1, A 3 = 2, R 3 = 2, A 4 — 2, R 4 = 2, 
A 5 = 3, R 3 = 0. On some of these time steps the e case may have occurred, causing an action to be 
selected at random. On which time steps did this definitely occur? On which time steps could this 
possibly have occurred? □ 


2.3 The 10-armed Testbed 

To roughly assess the relative effectiveness of the greedy and £-greedy methods, we compared them 
numerically on a suite of test problems. This was a set of 2000 randomly generated k -armed bandit 
problems with k = 10. For each bandit problem, such as the one shown in Figure 2.1, the action values, 
q*{a), a = 1,..., 10, were selected according to a normal (Gaussian) distribution with mean 0 and 
variance 1. Then, when a learning method applied to that problem selected action A t at time step t, 
the actual reward, Rt, was selected from a normal distribution with mean q*(A t ) and variance 1. These 
distributions are shown in gray in Figure 2.1. We call this suite of test tasks the 10-armed testbed. For 
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Figure 2.1: An example bandit problem from the 10-armed testbed. The true value q*(a) of each of the ten 
actions was selected according to a normal distribution with mean zero and unit variance, and then the actual 
rewards were selected according to a mean q*(a) unit variance normal distribution, as suggested by these gray 
distributions. 


any learning method, we can measure its performance and behavior as it improves with experience over 
1000 time steps when applied to one of the bandit problems. This makes up one run. Repeating this 
for 2000 independent runs, each with a different bandit problem, we obtained measures of the learning 
algorithm’s average behavior. 

Figure 2.2 compares a greedy method with two e-greedy methods (e = 0.01 and £ = 0.1), as described 
above, on the 10-armed testbed. All the methods formed their action-value estimates using the sample- 
average technique. The upper graph shows the increase in expected reward with experience. The greedy 
method improved slightly faster than the other methods at the very beginning, but then leveled off at 
a lower level. It achieved a reward-per-step of only about 1, compared with the best possible of about 
1.55 on this testbed. The greedy method performed significantly worse in the long run because it often 
got stuck performing suboptimal actions. The lower graph shows that the greedy method found the 
optimal action in only approximately one-third of the tasks. In the other two-thirds, its initial samples 
of the optimal action were disappointing, and it never returned to it. The e-greedy methods eventually 
performed better because they continued to explore and to improve their chances of recognizing the 
optimal action. The e = 0.1 method explored more, and usually found the optimal action earlier, but 
it never selected that action more than 91% of the time. The £ = 0.01 method improved more slowly, 
but eventually would perform better than the e = 0.1 method on both performance measures shown in 
the figure. It is also possible to reduce e over time to try to get the best of both high and low values. 

The advantage of £-greedy over greedy methods depends on the task. For example, suppose the 
reward variance had been larger, say 10 instead of 1. With noisier rewards it takes more exploration to 
find the optimal action, and £-greedy methods should fare even better relative to the greedy method. 
On the other hand, if the reward variances were zero, then the greedy method would know the true 
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Figure 2.2: Average performance of e-greedy action-value methods on the 10-armed testbed. These data are 
averages over 2000 runs with different bandit problems. All methods used sample averages as their action-value 
estimates. 


value of each action after trying it once. In this case the greedy method might actually perform best 
because it would soon find the optimal action and then never explore. But even in the deterministic 
case there is a large advantage to exploring if we weaken some of the other assumptions. For example, 
suppose the bandit task were nonstationary, that is, the true values of the actions changed over time. 
In this case exploration is needed even in the deterministic case to make sure one of the nongreedy 
actions has not changed to become better than the greedy one. As we shall see in the next few 
chapters, nonstationarity is the case most commonly encountered in reinforcement learning. Even if 
the underlying task is stationary and deterministic, the learner faces a set of banditlike decision tasks 
each of which changes over time as learning proceeds and the agent’s policy changes. Reinforcement 
learning requires a balance between exploration and exploitation. 

Exercise 2.3 In the comparison shown in Figure 2.2, which method will perform best in the long run 
in terms of cumulative reward and probability of selecting the best action? How much better will it 
be? Express your answer quantitatively. □ 


2.4 Incremental Implementation 

The action-value methods we have discussed so far all estimate action values as sample averages of 
observed rewards. We now turn to the question of how these averages can be computed in a computa¬ 
tionally efficient manner, in particular, with constant memory and constant per-time-step computation. 

To simplify notation we concentrate on a single action. Let Ri now denote the reward received after 
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the ith selection of this action , and let Q n denote the estimate of its action value after it has been 
selected n — 1 times, which we can now write simply as 

„ R\ + f?2 + • • • + An 

Vn • 

n — 1 

The obvious implementation would be to maintain a record of all the rewards and then perform this 
computation whenever the estimated value was needed. However, if this is done, then the memory and 
computational requirements would grow over time as more rewards are seen. Each additional reward 
would require additional memory to store it and additional computation to compute the sum in the 
numerator. 

As you might suspect, this is not really necessary. It is easy to devise incremental formulas for 
updating averages with small, constant computation required to process each new reward. Given Q n 
and the nth reward, R n , the new average of all n rewards can be computed by 



which holds even for n = 1, obtaining Q 2 = R\ for arbitrary Q \. This implementation requires memory 
only for Q n and n, and only the small computation (2.3) for each new reward. Pseudocode for a 
complete bandit algorithm using incrementally computed sample averages and e-greedy action selection 
is shown in the box on the next page. The function bandit(a ) is assumed to take an action and return 
a corresponding reward. 


A simple bandit algorithm 


Initialize, for a = 1 to k: 
Q(a) <— 0 
N(a) <- 0 


Repeat forever: 


A 


arg max„ Q(a) 
a random action 


R bandit(A) 


with probability 1 — e 
with probability e 


N(A) <r- N(A) + 1 
Q{A)^Q(A) + 1 ^[R-Q(A)] 


(breaking ties randomly) 


The update rule (2.3) is of a form that occurs frequently throughout this book. The general form is 
NewEstimate OldEstimate + StepSize [Target — OldEstimatcl. (2.4) 
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The expression [Target — OldEstimate ] is an error in the estimate. It is reduced by taking a step toward 
the “Target.” The target is presumed to indicate a desirable direction in which to move, though it may 
be noisy. In the case above, for example, the target is the ?xtlr reward. 

Note that the step-size parameter ( StepSize ) used in the incremental method described above changes 
from time step to time step. In processing the ?rth reward for action a, the method uses the step-size 
parameter -. In this book we denote the step-size parameter by a or, more generally, by at (a). We 
sometimes use the informal shorthand a = ^ when at. (a) = ^. leaving the dependence of n on the 
action implicit, just as we have in this section. 


2.5 Tracking a Nonstationary Problem 


The averaging methods discussed so far are appropriate for stationary bandit problems, that is, for 
bandit problems in which the reward probabilities do not change over time. As noted earlier, we often 
encounter reinforcement learning problems that are effectively nonstationary. In such cases it makes 
sense to give more weight to recent rewards than to long-past rewards. One of the most popular ways 
of doing this is to use a constant step-size parameter. For example, the incremental update rule (2.3) 
for updating an average Q n of the n — 1 past rewards is modified to be 


Qn-\-l — Qn Y ^ Rn Q 


(2.5) 


where the step-size parameter a G (0,1] is constant. 1 This results in Q n+ \ being a weighted average of 
past rewards and the initial estimate Q±: 


Qn +1 


Qn Y a 



aR n + (1 — 
aR n Y (1 — 
aR n + (1 — 
aR n + (1 — 


- Q nj 

a)Q n 

a) [aR n _i + (1 — a)Q n -\\ 
a)aR n -i —(— (1 a ) Q n — \ 
a)aR n — i Y (1 — a)^aR n — 2 -f- 
• • • + (1 - a) n ~ 1 aR\ + (1 - a) n Q\ 


(1 - a)"Qi + ^ a(l - a) n ~ i R i . 

i=1 


( 2 . 6 ) 


We call this a weighted average because the sum of the weights is (1 — a) n + YH=\ a Q — a) n ~ l = 1, 
as you can check for yourself. Note that the weight, a(l — a) n ~ l , given to the reward Ri depends 
on how many rewards ago, n — i, it was observed. The quantity 1 — a is less than 1, and thus the 
weight given to Ri decreases as the number of intervening rewards increases. In fact, the weight decays 
exponentially according to the exponent on 1 — a. (If 1 — a = 0, then all the weight goes on the 
very last reward, R n , because of the convention that 0° = 1.) Accordingly, this is sometimes called an 
exponential recency-weighted average. 

Sometimes it is convenient to vary the step-size parameter from step to step. Let a n (a ) denote the 
step-size parameter used to process the reward received after the nth selection of action a. As we have 
noted, the choice a n (a ) = ^ results in the sample-average method, which is guaranteed to converge to 
the true action values by the law of large numbers. But of course convergence is not guaranteed for all 
choices of the sequence {a n (a)}. A well-known result in stochastic approximation theory gives us the 

1 The notation (a, b] as a set denotes the real interval between a and b including b but not including a. Thus, here we 
are saying that 0 < a < 1. 
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conditions required to assure convergence with probability 1: 

OO OO 

^^a n (a)= oo and ^'a^(a)< oo. (2-7) 

n—1 n =1 

The first condition is required to guarantee that the steps are large enough to eventually overcome any 
initial conditions or random fluctuations. The second condition guarantees that eventually the steps 
become small enough to assure convergence. 

Note that both convergence conditions are met for the sample-average case, a n {a) = but not for 
the case of constant step-size parameter, a n (a) = a. In the latter case, the second condition is not 
met, indicating that the estimates never completely converge but continue to vary in response to the 
most recently received rewards. As we mentioned above, this is actually desirable in a nonstationary 
environment, and problems that are effectively nonstationary are the most common in reinforcement 
learning. In addition, sequences of step-size parameters that meet the conditions (2.7) often converge 
very slowly or need considerable tuning in order to obtain a satisfactory convergence rate. Although 
sequences of step-size parameters that meet these convergence conditions are often used in theoretical 
work, they are seldom used in applications and empirical research. 

Exercise 2.4 If the step-size parameters, a n , are not constant, then the estimate Q n is a weighted 
average of previously received rewards with a weighting different from that given by (2.6). What is 
the weighting on each prior reward for the general case, analogous to (2.6), in terms of the sequence of 
step-size parameters? □ 

Exercise 2.5 (programming) Design and conduct an experiment to demonstrate the difficulties 
that sample-average methods have for nonstationary problems. Use a modified version of the 10-armed 
testbed in which all the g* (a) start out equal and then take independent random walks (say by adding 
a normally distributed increment with mean zero and standard deviation 0.01 to all the g* (a) on each 
step). Prepare plots like Figure 2.2 for an action-value method using sample averages, incrementally 
computed, and another action-value method using a constant step-size parameter, a = 0.1. Use e = 0.1 
and longer runs, say of 10,000 steps. □ 


2.6 Optimistic Initial Values 

All the methods we have discussed so far are dependent to some extent on the initial action-value 
estimates, Qi(a). In the language of statistics, these methods are biased by their initial estimates. For 
the sample-average methods, the bias disappears once all actions have been selected at least once, but 
for methods with constant a , the bias is permanent, though decreasing over time as given by (2.6). In 
practice, this kind of bias is usually not a problem and can sometimes be very helpful. The downside is 
that the initial estimates become, in effect, a set of parameters that must be picked by the user, if only 
to set them all to zero. The upside is that they provide an easy way to supply some prior knowledge 
about what level of rewards can be expected. 

Initial action values can also be used as a simple way to encourage exploration. Suppose that instead 
of setting the initial action values to zero, as we did in the 10-armed testbed, we set them all to +5. 
Recall that the g*(a) in this problem are selected from a normal distribution with mean 0 and variance 1. 
An initial estimate of +5 is thus wildly optimistic. But this optimism encourages action-value methods 
to explore. Whichever actions are initially selected, the reward is less than the starting estimates; the 
learner switches to other actions, being “disappointed” with the rewards it is receiving. The result is 
that all actions are tried several times before the value estimates converge. The system does a fair 
amount of exploration even if greedy actions are selected all the time. 

Figure 2.3 shows the performance on the 10-armed bandit testbed of a greedy method using Qi(a) = 
+5, for all a. For comparison, also shown is an e-greedy method with Qi(a) = 0. Initially, the 
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Figure 2.3: The effect of optimistic initial action-value estimates on the 10-armed testbed. Both methods used 
a constant step-size parameter, a = 0.1. 


optimistic method performs worse because it explores more, but eventually it performs better because 
its exploration decreases with time. We call this technique for encouraging exploration optimistic initial 
values. We regard it as a simple trick that can be quite effective on stationary problems, but it is far 
from being a generally useful approach to encouraging exploration. For example, it is not well suited to 
nonstationary problems because its drive for exploration is inherently temporary. If the task changes, 
creating a renewed need for exploration, this method cannot help. Indeed, any method that focuses 
on the initial conditions in any special way is unlikely to help with the general nonstationary case. 
The beginning of time occurs only once, and thus we should not focus on it too much. This criticism 
applies as well to the sample-average methods, which also treat the beginning of time as a special event, 
averaging all subsequent rewards with equal weights. Nevertheless, all of these methods are very simple, 
and one of them—or some simple combination of them—is often adequate in practice. In the rest of 
this book we make frequent use of several of these simple exploration techniques. 

Exercise 2.6: Mysterious Spikes The results shown in Figure 2.3 should be quite reliable because 
they are averages over 2000 individual, randomly chosen 10-armed bandit tasks. Why, then, are there 
oscillations and spikes in the early part of the curve for the optimistic method? In other words, what 
might make this method perform particularly better or worse, on average, on particular early steps? □ 


2.7 Upper-Confidence-Bound Action Selection 

Exploration is needed because there is always uncertainty about the accuracy of the action-value es¬ 
timates. The greedy actions are those that look best at present, but some of the other actions may 
actually be better, e-greedy action selection forces the non-greedy actions to be tried, but indiscrim¬ 
inately, with no preference for those that are nearly greedy or particularly uncertain. It would be 
better to select among the non-greedy actions according to their potential for actually being optimal, 
taking into account both how close their estimates are to being maximal and the uncertainties in those 
estimates. One effective way of doing this is to select actions according to 


At = argmax 

a 

where lnt denotes the natural logarithm of t (the number that e sa 2.71828 would have to be raised to 
in order to equal t), N t (a) denotes the number of times that action a has been selected prior to time 


Qt(a ) + c 



( 2 . 8 ) 
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t (the denominator in (2.1)), and the number c > 0 controls the degree of exploration. If N t (a) = 0, 
then a is considered to be a maximizing action. 

The idea of this upper confidence bound (UCB) action selection is that the square-root term is a 
measure of the uncertainty or variance in the estimate of a’s value. The quantity being max’ed over is 
thus a sort of upper bound on the possible true value of action a , with c determining the confidence level. 
Each time a is selected the uncertainty is presumably reduced: N t (a ) increments, and, as it appears in 
the denominator, the uncertainty term decreases. On the other hand, each time an action other than a 
is selected, t increases but N t (a ) does not; because t appears in the numerator, the uncertainty estimate 
increases. The use of the natural logarithm means that the increases get smaller over time, but are 
unbounded; all actions will eventually be selected, but actions with lower value estimates, or that have 
already been selected frequently, will be selected with decreasing frequency over time. 

Results with UCB on the 10-armed testbed are shown in Figure 2.4. UCB often performs well, 
as shown here, but is more difficult than e-greedy to extend beyond bandits to the more general 
reinforcement learning settings considered in the rest of this book. One difficulty is in dealing with 
nonstationary problems; methods more complex than those presented in Section 2.5 would be needed. 
Another difficulty is dealing with large state spaces, particularly when using function approximation as 
developed in Part II of this book. In these more advanced settings the idea of UCB action selection is 
usually not practical. 


2.8 Gradient Bandit Algorithms 


So far in this chapter we have considered methods that estimate action values and use those estimates 
to select actions. This is often a good approach, but it is not the only one possible. In this section 
we consider learning a numerical preference for each action a, which we denote H t (a). The larger the 
preference, the more often that action is taken, but the preference has no interpretation in terms of 
reward. Only the relative preference of one action over another is important; if we add 1000 to all the 
preferences there is no effect on the action probabilities, which are determined according to a soft-max 
distribution (i.e., Gibbs or Boltzmann distribution) as follows: 


Pr{A t 


e Ht(a) 


7Tt(a), 


(2.9) 


Average 

reward 



Figure 2.4: Average performance of UCB action selection on the 10-armed testbed. As shown, UCB generally 
performs better than e-greedy action selection, except in the first k steps, when it selects randomly among the 
as-yet-untried actions. 
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where here we have also introduced a useful new notation, 7r t (a), for the probability of taking action a 
at time t. Initially all preferences are the same (e.g., H\(a) = 0, for all a) so that all actions have an 
equal probability of being selected. 

Exercise 2.7 Show that in the case of two actions, the soft-max distribution is the same as that given 
by the logistic, or sigmoid, function often used in statistics and artificial neural networks. □ 

There is a natural learning algorithm for this setting based on the idea of stochastic gradient ascent. 
On each step, after selecting action A t and receiving the reward Rt, preferences are updated by: 

Ht+i(A t ) = + a(R t — Rt) (l — TTt(At)), and 

H t+ i(a) = H t {a) - a(R t - Rt)TT t {a), for all a ^ A t , 

where a > 0 is a step-size parameter, and R t £ ffi. is the average of all the rewards up through and 
including time t, which can be computed incrementally as described in Section 2.4 (or Section 2.5 if 
the problem is nonstationary). The R t term serves as a baseline with which the reward is compared. 
If the reward is higher than the baseline, then the probability of taking A t in the future is increased, 
and if the reward is below baseline, then probability is decreased. The non-selected actions move in the 
opposite direction. 

Figure 2.5 shows results with the gradient bandit algorithm on a variant of the 10-armed testbed in 
which the true expected rewards were selected according to a normal distribution with a mean of +4 
instead of zero (and with unit variance as before). This shifting up of all the rewards has absolutely 
no effect on the gradient bandit algorithm because of the reward baseline term, which instantaneously 
adapts to the new level. But if the baseline were omitted (that is, if Rt was taken to be constant zero 
in (2.10)), then performance would be significantly degraded, as shown in the figure. 
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Figure 2.5: Average performance of the gradient bandit algorithm with and without a reward baseline on the 
10-armed testbed when the q*(a) are chosen to be near +4 rather than near zero. 



The Bandit Gradient Algorithm as Stochastic Gradient Ascent 


One can gain a deeper insight into the gradient bandit algorithm by understanding it as a 
stochastic approximation to gradient ascent. In exact gradient ascent , each preference H t (a) 
would be incremented proportional to the increment’s effect on performance: 

H t+1 (a) = H t (a)+a^ (2.11) 
oH t (a) 
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where the measure of performance here is the expected reward: 

E i R t] ='52 7r t{ b )q*( b ), 

b 


and the measure of the increment’s effect is the partial derivative of this performance measure 
with respect to the preference. Of course, it is not possible to implement gradient ascent exactly 
in our case because by assumption we do not know the q*(b), but in fact the updates of our 
algorithm (2.10) are equal to (2.11) in expected value, making the algorithm an instance of 
stochastic gradient ascent. The calculations showing this require only beginning calculus, but 
take several steps. First we take a closer look at the exact performance gradient: 


dH t (a) 


d 

dH t (a) 


^2q*(b) 


J2Mb)q*{b) 

. b 

dn t (b) 

dH t (a) 


J2(q*( b )- X t) 


dn t (b) 

dH t { a y 


where X t can be any scalar that does not depend on b. We can include it here because the 
gradient sums to zero over all the actions, Ylb dH^a) = 0- As H t (a) is changed, some actions’ 
probabilities go up and some down, but the sum of the changes must be zero because the sum 
of the probabilities must remain one. 


<9ELR t ] 

dH t (a) 


J2 n t(b)(q*(b) - x t ) 


dn t (b) 

dH t (a) 


/Mb) 


The equation is now in the form of an expectation, summing over all possible values b of the 
random variable A t , then multiplying by the probability of taking those values. Thus: 


= E 


= E 




where here we have chosen X t = R t and substituted R t for q*(A t ), which is permitted because 
E[i?t|A t ] = q*{A t ) and because the R t (given A t ) is uncorrelated with anything else. Shortly we 
will establish that = 7r «(^)(la=b — 7r t (a)), where l a= & is defined to be 1 if a = b , else 0. 

Assuming that for now, we have 


= E[(i? t - Rt)Tr t (A t )(l a =A t - n t (a))/ir t (A t )] 
= E[(i? t - R t )(l a =A t - 7r t (a))] . 


Recall that our plan has been to write the performance gradient as an expectation of something 
that we can sample on each step, as we have just done, and then update on each step proportional 
to the sample. Substituting a sample of the expectation above for the performance gradient in 
(2.11) yields: 


H t+1 (a) = H t (a) + a(R t - R t )(t a =A t - M a ))’ for a11 T 
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which you may recognize as being equivalent to our original algorithm (2.10). 

Thus it remains only to show that g^^ = 7r t (&)(l 0= & — 7r t (o)), as we assumed. Recall the 
standard quotient rule for derivatives: 


d_ 

dx 


fix) 

9(x) 




Using this, we can write 


<9 7 T t (b) 
dH t (a) 


1 a=be Ht{b) J2c- 1 eIit{c) - e Ht ^e Ht <“) 

(Ej.i 

l a = b e H ^ e gtW e gt(q) 

Ec=i e " t(c) (Ec=i e H ^y 

= la=67r t (6) - 7r t (6)7r t (a) 

= 7T t (6)(l a= 6 - 7T t (a)). 
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n t (b) 


dH t {a) 

d [ e H ‘W 
5H t (a) X]c=i 


dH t ( 


a) Ec=l e 


dH t (a) 


(by the quotient rule) 

(because = e*) 


Q.E.D. 


We have just shown that the expected update of the gradient bandit algorithm is equal to the 
gradient of expected reward, and thus that the algorithm is an instance of stochastic gradient 
ascent. This assures us that the algorithm has robust convergence properties. 

Note that we did not require any properties of the reward baseline other than that it does not 
depend on the selected action. For example, we could have set it to zero, or to 1000, and the 
algorithm would still be an instance of stochastic gradient ascent. The choice of the baseline does 
not affect the expected update of the algorithm, but it does affect the variance of the update 
and thus the rate of convergence (as shown, e.g., in Figure 2.5). Choosing it as the average of 
the rewards may not be the very best, but it is simple and works well in practice. 


2.9 Associative Search (Contextual Bandits) 

So far in this chapter we have considered only nonassociative tasks, that is, tasks in which there is no 
need to associate different actions with different situations. In these tasks the learner either tries to 
find a single best action when the task is stationary, or tries to track the best action as it changes over 
time when the task is nonstationary. However, in a general reinforcement learning task there is more 
than one situation, and the goal is to learn a policy: a mapping from situations to the actions that are 
best in those situations. To set the stage for the full problem, we briefly discuss the simplest way in 
which nonassociative tasks extend to the associative setting. 

As an example, suppose there are several different fc-armed bandit tasks, and that on each step you 
confront one of these chosen at random. Thus, the bandit task changes randomly from step to step. 
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This would appear to you as a single, nonstationary /c-armed bandit task whose true action values 
change randomly from step to step. You could try using one of the methods described in this chapter 
that can handle nonstationarity, but unless the true action values change slowly, these methods will 
not work very well. Now suppose, however, that when a bandit task is selected for you, you are given 
some distinctive clue about its identity (but not its action values). Maybe you are facing an actual 
slot machine that changes the color of its display as it changes its action values. Now you can learn a 
policy associating each task, signaled by the color you see, with the best action to take when facing that 
task—for instance, if red, select arm 1; if green, select arm 2. With the right policy you can usually 
do much better than you could in the absence of any information distinguishing one bandit task from 
another. 

This is an example of an associative search task, so called because it involves both trial-and-error 
learning to search for the best actions, and association of these actions with the situations in which they 
are best. Associative search tasks are often now called contextual bandits in the literature. Associative 
search tasks are intermediate between the fc-armed bandit problem and the full reinforcement learning 
problem. They are like the full reinforcement learning problem in that they involve learning a policy, 
but like our version of the fc-armed bandit problem in that each action affects only the immediate 
reward. If actions are allowed to affect the next situation as well as the reward, then we have the 
full reinforcement learning problem. We present this problem in the next chapter and consider its 
ramifications throughout the rest of the book. 

Exercise 2.8 Suppose you face a 2-armed bandit task whose true action values change randomly from 
time step to time step. Specifically, suppose that, for any time step, the true values of actions 1 and 
2 are respectively 0.1 and 0.2 with probability 0.5 (case A), and 0.9 and 0.8 with probability 0.5 (case 
B). If you are not able to tell which case you face at any step, what is the best expectation of success 
you can achieve and how should you behave to achieve it? Now suppose that on each step you are told 
whether you are facing case A or case B (although you still don’t know the true action values). This 
is an associative search task. What is the best expectation of success you can achieve in this task, and 
how should you behave to achieve it? □ 


2.10 Summary 

We have presented in this chapter several simple ways of balancing exploration and exploitation. The 
e-greedy methods choose randomly a small fraction of the time, whereas UCB methods choose deter¬ 
ministically but achieve exploration by subtly favoring at each step the actions that have so far received 
fewer samples. Gradient bandit algorithms estimate not action values, but action preferences, and favor 
the more preferred actions in a graded, probabilistic manner using a soft-max distribution. The simple 
expedient of initializing estimates optimistically causes even greedy methods to explore significantly. 

It is natural to ask which of these methods is best. Although this is a difficult question to answer 
in general, we can certainly run them all on the 10-armed testbed that we have used throughout this 
chapter and compare their performances. A complication is that they all have a parameter; to get a 
meaningful comparison we have to consider their performance as a function of their parameter. Our 
graphs so far have shown the course of learning over time for each algorithm and parameter setting, to 
produce a learning curve for that algorithm and parameter setting. If we plotted learning curves for 
all algorithms and all parameter settings, then the graph would be too complex and crowded to make 
clear comparisons. Instead we summarize a complete learning curve by its average value over the 1000 
steps; this value is proportional to the area under the learning curve. Figure 2.6 shows this measure 
for the various bandit algorithms from this chapter, each as a function of its own parameter shown on 
a single scale on the x-axis. This kind of graph is called a parameter study. Note that the parameter 
values are varied by factors of two and presented on a log scale. Note also the characteristic inverted-U 
shapes of each algorithm’s performance; all the algorithms perform best at an intermediate value of 
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Figure 2.6: A parameter study of the various bandit algorithms presented in this chapter. Each point is the 
average reward obtained over 1000 steps with a particular algorithm at a particular setting of its parameter. 


their parameter, neither too large nor too small. In assessing a method, we should attend not just to 
how well it does at its best parameter setting, but also to how sensitive it is to its parameter value. All 
of these algorithms are fairly insensitive, performing well over a range of parameter values varying by 
about an order of magnitude. Overall, on this problem, UCB seems to perform best. 

Despite their simplicity, in our opinion the methods presented in this chapter can fairly be considered 
the state of the art. There are more sophisticated methods, but their complexity and assumptions make 
them impractical for the full reinforcement learning problem that is our real focus. Starting in Chapter 5 
we present learning methods for solving the full reinforcement learning problem that use in part the 
simple methods explored in this chapter. 

Although the simple methods explored in this chapter may be the best we can do at present, they 
are far from a fully satisfactory solution to the problem of balancing exploration and exploitation. 

One well-studied approach to balancing exploration and exploitation in fc-armed bandit problems is 
to compute special functions called Gittins indices. These provide an optimal solution to a certain kind 
of bandit problem more general than that considered here, but this approach assumes that the prior 
distribution of possible problems is known. Unfortunately, neither the theory nor the computational 
tractability of this method appear to generalize to the full reinforcement learning problem that we 
consider in the rest of the book. 

Bayesian methods assume a known initial distribution over the action values and then update the 
distribution exactly after each step (assuming that the true action values are stationary). In general, 
the update computations can be very complex, but for certain special distributions (called conjugate 
priors ) they are easy. One possibility is to then select actions at each step according to their posterior 
probability of being the best action. This method, sometimes called posterior sampling or Thompson 
sampling , often performs similarly to the best of the distribution-free methods we have presented in 
this chapter. 

In the Bayesian setting it is even conceivable to compute the optimal balance between exploration 
and exploitation. One can compute for any possible action the probability of each possible immediate 
reward and the resultant posterior distributions over action values. This evolving distribution becomes 
the information state of the problem. Given a horizon, say of 1000 steps, one can consider all possible 
actions, all possible resulting rewards, all possible next actions, all next rewards, and so on for all 1000 
steps. Given the assumptions, the rewards and probabilities of each possible chain of events can be 
determined, and one need only pick the best. But the tree of possibilities grows extremely rapidly; even 
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if there were only two actions and two rewards, the tree would have 2 2000 leaves. It is generally not 
feasible to perform this immense computation exactly, but perhaps it could be approximated efficiently. 
This approach would effectively turn the bandit problem into an instance of the full reinforcement 
learning problem. In the end, we may be able to use approximate reinforcement learning methods such 
as those presented in Part II of this book to approach this optimal solution. But that is a topic for 
research and beyond the scope of this introductory book. 

Exercise 2.9 (programming) Make a figure analogous to Figure 2.6 for the non-stationary case 
outlined in Exercise 2.5. Include the constant-step-size e-greedy algorithm with cc = 0.1. Use runs of 
200,000 steps and, as a performance measure for each algorithm and parameter setting, use the average 
reward over the last 100,000 steps. □ 


Bibliographical and Historical Remarks 

2.1 Bandit problems have been studied in statistics, engineering, and psychology. In statistics, ban¬ 
dit problems fall under the heading “sequential design of experiments,” introduced by Thomp¬ 
son (1933, 1934) and Robbins (1952), and studied by Bellman (1956). Berry and Fristedt (1985) 
provide an extensive treatment of bandit problems from the perspective of statistics. Narendra 
and Thathachar (1989) treat bandit problems from the engineering perspective, providing a 
good discussion of the various theoretical traditions that have focused on them. In psychology, 
bandit problems have played roles in statistical learning theory (e.g., Bush and Mosteller, 1955; 
Estes, 1950). 

The term greedy is often used in the heuristic search literature (e.g., Pearl, 1984). The conflict 
between exploration and exploitation is known in control engineering as the conflict between 
identification (or estimation) and control (e.g., Witten, 1976). Feldbaum (1965) called it the 
dual control problem, referring to the need to solve the two problems of identification and 
control simultaneously when trying to control a system under uncertainty. In discussing aspects 
of genetic algorithms, Holland (1975) emphasized the importance of this conflict, referring to 
it as the conflict between the need to exploit and the need for new information. 

2.2 Action-value methods for our fc-armed bandit problem were first proposed by Thathachar and 
Sastry (1985). These are often called estimator algorithms in the learning automata literature. 
The term action value is due to Watkins (1989). The first to use e-greedy methods may also 
have been Watkins (1989, p. 187), but the idea is so simple that some earlier use seems likely. 

2.4—5 This material falls under the general heading of stochastic iterative algorithms, which is well 
covered by Bertsekas and Tsitsiklis (1996). 

2.6 Optimistic initialization was used in reinforcement learning by Sutton (1996). 

2.7 Early work on using estimates of the upper confidence bound to select actions was done by Lai 
and Robbins (1985), Kaelbling (1993b), and Agrawal (1995). The UCB algorithm we present 
here is called UCB1 in the literature and was first developed by Auer, Cesa-Bianchi and Fischer 
( 2002 ). 

2.8 Gradient bandit algorithms are a special case of the gradient-based reinforcement learning 
algorithms introduced by Williams (1992), and that later developed into the actor-critic and 
policy-gradient algorithms that we treat later in this book. Our development here was influenced 
by that by Balaraman Ravindran (personal communication). Further discussion of the choice 
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of baseline is provided there and by Greensmith, Bartlett, and Baxter (2001, 2004) and Dick 
(2015). 

The term soft-max for the action selection rule (2.9) is due to Bridle (1990). This rule appears 
to have been first proposed by Luce (1959). 

2.9 The term associative search and the corresponding problem were introduced by Barto, Sutton, 
and Brouwer (1981). The term associative reinforcement learning has also been used for asso¬ 
ciative search (Barto and Anandan, 1985), but we prefer to reserve that term as a synonym for 
the full reinforcement learning problem (as in Sutton, 1984). (And, as we noted, the modern 
literature also uses the term “contextual bandits” for this problem.) We note that Thorndike’s 
Law of Effect (quoted in Chapter 1) describes associative search by referring to the formation 
of associative links between situations (states) and actions. According to the terminology of 
operant, or instrumental, conditioning (e.g., Skinner, 1938), a discriminative stimulus is a stim¬ 
ulus that signals the presence of a particular reinforcement contingency. In our terms, different 
discriminative stimuli correspond to different states. 

2.10 Bellman (1956) was the first to show how dynamic programming could be used to compute 
the optimal balance between exploration and exploitation within a Bayesian formulation of 
the problem. The Gittins index approach is due to Gittins and Jones (1974). Duff (1995) 
showed how it is possible to learn Gittins indices for bandit problems through reinforcement 
learning. The survey by Kumar (1985) provides a good discussion of Bayesian and non-Bayesian 
approaches to these problems. The term information state comes from the literature on partially 
observable MDPs; see, e.g., Lovejoy (1991). 

Other theoretical research focuses on the efficiency of exploration, usually expressed as how 
quickly an algorithm can approach an optimal policy. One way to formalize exploration effi¬ 
ciency is by adapting to reinforcement learning the notion of sample complexity for a supervised 
learning algorithm, which is the number of training examples the algorithm needs to attain a 
desired degree of accuracy in learning the target function. A definition of the sample complex¬ 
ity of exploration for a reinforcement learning algorithm is the number of time steps in which 
the algorithm does not select near-optimal actions (Kakade, 2003). Li (2012) discusses this 
and several other approaches in a survey of theoretical approaches to exploration efficiency in 
reinforcement learning. 
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Finite Markov Decision Processes 


In this chapter we introduce the formal problem of finite Markov decision processes, or finite MDPs, 
which we try to solve in the rest of the book. This problem involves evaluative feedback, as in bandits, 
but also an associative aspect—choosing different actions in different situations. MDPs are a classical 
formalization of sequential decision making, where actions influence not just immediate rewards, but also 
subsequent situations, or states, and through those future rewards. Thus MDPs involve delayed reward 
and the need to tradeoff immediate and delayed reward. Whereas in bandit problems we estimated 
the value q*(a) of each action a, in MDPs we estimate the value < 7 *(s, a) of each action a in each state 
s, or we estimate the value u*(s) of each state given optimal action selections. These state-dependent 
quantities are essential to accurately assigning credit for long-term consequences to individual action 
selections . 

MDPs are a mathematically idealized form of the reinforcement learning problem for which precise 
theoretical statements can be made. We introduce key elements of the problem’s mathematical struc¬ 
ture, such as returns, value functions, and Bellman equations. We try to convey the wide range of 
applications that can be formulated as finite MDPs. As in all of artificial intelligence, there is a tension 
between breadth of applicability and mathematical tractability. In this chapter we introduce this ten¬ 
sion and discuss some of the trade-offs and challenges that it implies. Some ways in which reinforcement 
learning can be taken beyond MDPs are treated in Chapter 17. 


3.1 The Agent—Environment Interface 

MDPs are meant to be a straightforward framing of the problem of learning from interaction to achieve 
a goal. The learner and decision maker is called the agent. The thing it interacts with, comprising 
everything outside the agent, is called the environment. These interact continually, the agent selecting 
actions and the environment responding to these actions and presenting new situations to the agent. 1 
The environment also gives rise to rewards, special numerical values that the agent seeks to maximize 
over time through its choice of actions. See Figure 3.1. 

More specifically, the agent and environment interact at each of a sequence of discrete time steps, 
t = 0,1,2,3,.. .. 2 At each time step t,, the agent receives some representation of the environment’s state , 
St £ §, and on that basis selects an action , A t € A(s). 3 One time step later, in part as a consequence of 

1 We use the terms agent , environment , and action instead of the engineers’ terms controller, controlled system (or 
plant), and control signal because they are meaningful to a wider audience. 

-We restrict attention to discrete time to keep things as simple as possible, even though many of the ideas can be 
extended to the continuous-time case (e.g., see Bertsekas and Tsitsiklis, 1996; Doya, 1996). 

3 To simplify notation, we sometimes assume the special case in which the action set is the same in all states and write 
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Figure 3.1: The agent-environment interaction in a Markov decision process. 


its action, the agent receives a numerical reward , Rt+i £ A C R, and finds itself in a new state, 5t+i . 4 
The MDP and agent together thereby give rise to a sequence or trajectory that begins like this: 

So, Ao, Ri, S-\ , A-[ , R .'2 , S 2 , A 2 , R 3 ,... (3-1) 

In a finite MDP, the sets of states, actions, and rewards (S, A , and D?) all have a finite number of 
elements. In this case, the random variables Rt and St have well defined discrete probability distribu¬ 
tions dependent only on the preceding state and action. That is, for particular values of these random 
variables, s' £ § and r £ 1 R, there is a probability of those values occurring at time t, given particular 
values of the preceding state and action: 

p{s',r\s,a) = Pr{S t = s',R t = r \ S t _ 1 = s,A t _ 1 =a}, (3.2) 

for all s',s £ §, r £ tR, and a £ A(s). The dot over the equals sign in this equation reminds us that it 
is a definition (in this case of the function p) rather than a fact that follows from previous definitions. 
The function p : § x 01 x § x A —> [0,1] is an ordinary deterministic function of four arguments. The ‘|’ 
in the middle of it comes from the notation for conditional probability, but here it just reminds us that 
p specifies a probability distribution for each choice of s and a, that is, that 

EE p{s', r | s, a) = 1, for all s £ S, a £ A(s). (3.3) 

s'6 s re 3? 

The probabilities given by the four-argument function p completely characterize the dynamics of a 
finite MDP. From it, one can compute anything else one might want to know about the environment, 
such as the state-transition probabilities (which we denote, with a slight abuse of notation, as a three- 
argument function p : § x § x A —» [0,1]), 

p(s'|s,a) = Pr{5 t = s / | S t -i=s, A t _ 1 =a} = ^p(s',r|s, a). (3.4) 

reoi 

We can also compute the expected rewards for state-action pairs as a two-argument function r : § x A —> 
R: 


r{s, a) = E[R t | S t -i = s,A t - 1 =a] = ^ r ^ p(s', r |s, a), (3.5) 

reR s'es 

or the expected rewards for state-action- next-state triples as a three-argument function r: SxAxS-> 

M, 


r(s,a,s') = E[f?i | S t -i =s, A t _i =a, S t = s'] 


E 

reoi 


P(s',r\s,a) 

T - 

p(s'|s,a) 


(3.6) 


it simply as A. 

4 We use Rt+i instead of Rt to denote the reward due to At because it emphasizes that the next reward and next 
state, Rt+i and St-\-i, are jointly determined. Unfortunately, both conventions are widely used in the literature. 
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In this book, we usually use the four-argument p function (3.2), but each of these other notations are 
occasionally convenient. 

The MDP framework is abstract and flexible and can be applied to many different problems in many 
different ways. For example, the time steps need not refer to fixed intervals of real time; they can refer 
to arbitrary successive stages of decision making and acting. The actions can be low-level controls, such 
as the voltages applied to the motors of a robot arm, or high-level decisions, such as whether or not to 
have lunch or to go to graduate school. Similarly, the states can take a wide variety of forms. They can 
be completely determined by low-level sensations, such as direct sensor readings, or they can be more 
high-level and abstract, such as symbolic descriptions of objects in a room. Some of what makes up a 
state could be based on memory of past sensations or even be entirely mental or subjective. For example, 
an agent could be in the state of not being sure where an object is, or of having just been surprised 
in some clearly defined sense. Similarly, some actions might be totally mental or computational. For 
example, some actions might control what an agent chooses to think about, or where it focuses its 
attention. In general, actions can be any decisions we want to learn how to make, and the states can 
be anything we can know that might be useful in making them. 

In particular, the boundary between agent and environment is typically not the same as the physical 
boundary of robot’s or animal’s body. Usually, the boundary is drawn closer to the agent than that. 
For example, the motors and mechanical linkages of a robot and its sensing hardware should usually 
be considered parts of the environment rather than parts of the agent. Similarly, if we apply the MDP 
framework to a person or animal, the muscles, skeleton, and sensory organs should be considered part 
of the environment. Rewards, too, presumably are computed inside the physical bodies of natural and 
artificial learning systems, but are considered external to the agent. 

The general rule we follow is that anything that cannot be changed arbitrarily by the agent is 
considered to be outside of it and thus part of its environment. We do not assume that everything in 
the environment is unknown to the agent. For example, the agent often knows quite a bit about how 
its rewards are computed as a function of its actions and the states in which they are taken. But we 
always consider the reward computation to be external to the agent because it defines the task facing 
the agent and thus must be beyond its ability to change arbitrarily. In fact, in some cases the agent may 
know everything about how its environment works and still face a difficult reinforcement learning task, 
just as we may know exactly how a puzzle like Rubik’s cube works, but still be unable to solve it. The 
agent-environment boundary represents the limit of the agent’s absolute control , not of its knowledge. 

The agent-environment boundary can be located at different places for different purposes. In a 
complicated robot, many different agents may be operating at once, each with its own boundary. For 
example, one agent may make high-level decisions which form part of the states faced by a lower- 
level agent that implements the high-level decisions. In practice, the agent-environment boundary is 
determined once one has selected particular states, actions, and rewards, and thus has identified a 
specific decision making task of interest. 

The MDP framework is a considerable abstraction of the problem of goal-directed learning from 
interaction. It proposes that whatever the details of the sensory, memory, and control apparatus, and 
whatever objective one is trying to achieve, any problem of learning goal-directed behavior can be 
reduced to three signals passing back and forth between an agent and its environment: one signal to 
represent the choices made by the agent (the actions), one signal to represent the basis on which the 
choices are made (the states), and one signal to define the agent’s goal (the rewards). This framework 
may not be sufficient to represent all decision-learning problems usefully, but it has proved to be widely 
useful and applicable. 

Of course, the particular states and actions vary greatly from task to task, and how they are repre¬ 
sented can strongly affect performance. In reinforcement learning, as in other kinds of learning, such 
representational choices are at present more art than science. In this book we offer some advice and 
examples regarding good ways of representing states and actions, but our primary focus is on general 
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principles for learning how to behave once the representations have been selected. 

Example 3.1: Bioreactor Suppose reinforcement learning is being applied to determine moment-by¬ 
moment temperatures and stirring rates for a bioreactor (a large vat of nutrients and bacteria used to 
produce useful chemicals). The actions in such an application might be target temperatures and target 
stirring rates that are passed to lower-level control systems that, in turn, directly activate heating 
elements and motors to attain the targets. The states are likely to be thermocouple and other sensory 
readings, perhaps filtered and delayed, plus symbolic inputs representing the ingredients in the vat 
and the target chemical. The rewards might be moment-by-moment measures of the rate at which 
the useful chemical is produced by the bioreactor. Notice that here each state is a list, or vector, of 
sensor readings and symbolic inputs, and each action is a vector consisting of a target temperature 
and a stirring rate. It is typical of reinforcement learning tasks to have states and actions with such 
structured representations. Rewards, on the other hand, are always single numbers. ■ 

Example 3.2: Pick-and-Place Robot Consider using reinforcement learning to control the motion 
of a robot arm in a repetitive pick-and-place task. If we want to learn movements that are fast and 
smooth, the learning agent will have to control the motors directly and have low-latency information 
about the current positions and velocities of the mechanical linkages. The actions in this case might 
be the voltages applied to each motor at each joint, and the states might be the latest readings of joint 
angles and velocities. The reward might be +1 for each object successfully picked up and placed. To 
encourage smooth movements, on each time step a small, negative reward can be given as a function of 
the moment-to-moment “jerkiness” of the motion. ■ 

Example 3.3: Recycling Robot A mobile robot has the job of collecting empty soda cans in 
an office environment. It has sensors for detecting cans, and an arm and gripper that can pick them 
up and place them in an onboard bin; it runs on a rechargeable battery. The robot’s control system 
has components for interpreting sensory information, for navigating, and for controlling the arm and 
gripper. High-level decisions about how to search for cans are made by a reinforcement learning agent 
based on the current charge level of the battery. This agent has to decide whether the robot should (1) 
actively search for a can for a certain period of time, (2) remain stationary and wait for someone to 
bring it a can, or (3) head back to its home base to recharge its battery. This decision has to be made 
either periodically or whenever certain events occur, such as finding an empty can. The agent therefore 
has three actions, and the state is primarily determined by the state of the battery. The rewards might 
be zero most of the time, but then become positive when the robot secures an empty can, or large and 
negative if the battery runs all the way down. In this example, the reinforcement learning agent is not 
the entire robot. The states it monitors describe conditions within the robot itself, not conditions of the 
robot’s external environment. The agent’s environment therefore includes the rest of the robot, which 
might contain other complex decision-making systems, as well as the robot’s external environment. 


Exercise 3.1 Devise three example tasks of your own that fit into the MDP framework, identifying for 
each its states, actions, and rewards. Make the three examples as different from each other as possible. 
The framework is abstract and flexible and can be applied in many different ways. Stretch its limits in 
some way in at least one of your examples. □ 

Exercise 3.2 Is the MDP framework adequate to usefully represent all goal-directed learning tasks? 
Can you think of any clear exceptions? □ 

Exercise 3.3 Consider the problem of driving. You could define the actions in terms of the accelerator, 
steering wheel, and brake, that is, where your body meets the machine. Or you could define them farther 
out—say, where the rubber meets the road, considering your actions to be tire torques. Or you could 
define them farther in—say, where your brain meets your body, the actions being muscle twitches to 
control your limbs. Or you could go to a really high level and say that your actions are your choices of 
where to drive. What is the right level, the right place to draw the line between agent and environment? 
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On what basis is one location of the line to be preferred over another? Is there any fundamental reason 
for preferring one location over another, or is it a free choice? □ 

Exercise 3.4 If the current state is St, and actions are selected according to stochastic policy ir, then 
what is the expectation of Rt+i in terms of the four-argument function p (3.2)? □ 

Example 3.4: Recycling Robot MDP The recycling robot (Example 3.3) can be turned into a 
simple example of an MDP by simplifying it and providing some more details. (Our aim is to produce 
a simple example, not a particularly realistic one.) Recall that the agent makes a decision at times 
determined by external events (or by other parts of the robot’s control system). At each such time the 
robot decides whether it should (1) actively search for a can, (2) remain stationary and wait for someone 
to bring it a can, or (3) go back to home base to recharge its battery. Suppose the environment works as 
follows. The best way to find cans is to actively search for them, but this runs down the robot’s battery, 
whereas waiting does not. Whenever the robot is searching, the possibility exists that its battery will 
become depleted. In this case the robot must shut down and wait to be rescued (producing a low 
reward). 

The agent makes its decisions solely as a function of the energy level of the battery. It can distinguish 
two levels, high and low, so that the state set is S = {high, low}. Let us call the possible decisions—the 
agent’s actions- wait, search, and recharge. When the energy level is high, recharging would always 
be foolish, so we do not include it in the action set for this state. The agent’s action sets are 


A(high) 

A(low) 


{search, wait} 

{search, wait, recharge}. 


If the energy level is high, then a period of active search can always be completed without risk of 
depleting the battery. A period of searching that begins with a high energy level leaves the energy level 
high with probability a and reduces it to low with probability 1 — a. On the other hand, a period 
of searching undertaken when the energy level is low leaves it low with probability /3 and depletes 
the battery with probability 1 — /?. In the latter case, the robot must be rescued, and the battery 
is then recharged back to high. Each can collected by the robot counts as a unit reward, whereas a 
reward of —3 results whenever the robot has to be rescued. Let r sear ch and r wa i t , with ?’ sear ch > r wa i t , 
respectively denote the expected number of cans the robot will collect (and hence the expected reward) 
while searching and while waiting. Finally, to keep things simple, suppose that no cans can be collected 
during a run home for recharging, and that no cans can be collected on a step in which the battery is 
depleted. This system is then a finite MDP, and we can write down the transition probabilities and the 
expected rewards, as in Table 3.1. 

A transition graph is a useful way to summarize the dynamics of a finite MDP. Figure 3.2 shows the 
transition graph for the recycling robot example. There are two kinds of nodes: state nodes and action 
nodes. There is a state node for each possible state (a large open circle labeled by the name of the 
state), and an action node for each state-action pair (a small solid circle labeled by the name of the 
action and connected by a line to the state node). Starting in state s and taking action a moves you 
along the line from state node s to action node (s, a). Then the environment responds with a transition 
to the next state’s node via one of the arrows leaving action node ( s,a ). Each arrow corresponds to 
a triple (s,s',a), where s' is the next state, and we label the arrow with the transition probability, 
p(s' | s, a), and the expected reward for that transition, r(s, a, s'). Note that the transition probabilities 
labeling the arrows leaving an action node always sum to 1. ■ 

Exercise 3.5 Give a table analogous to to Table 3.1, but for p(s',r\s,a). It should have columns for 
s, a, s', r, and p(s',r\s, a), and a row for every 4-tuple for which p(s',r|s,a) >0. □ 



42 


CHAPTER 3. FINITE MARKOV DECISION PROCESSES 


s 

a 

s' 

p(s' | s, a) 

r{s,a,s') 

high 

search 

high 

a 

T search 

high 

search 

low 

1 — a 

T search 

low 

search 

high 

1-/3 

-3 

low 

search 

low 

P 

T search 

high 

wait 

high 

1 

^wait 

high 

wait 

low 

0 

^wait 

low 

wait 

high 

0 

^wait 

low 

wait 

low 

1 

^wait 

low 

recharge 

high 

1 

0 

low 

recharge 

low 

0 

0 . 


Table 3.1: Transition probabilities and expected rewards for the finite MDP of the recycling robot example. 
There is a row for each possible combination of current state, s, next state, s', and action possible in the current 
state, a € A(s). 



Figure 3.2: Transition graph for the recycling robot example. 


3.2 Goals and Rewards 

In reinforcement learning, the purpose or goal of the agent is formalized in terms of a special signal, 
called the reward, passing from the environment to the agent. At each time step, the reward is a simple 
number, R t £ R. Informally, the agent’s goal is to maximize the total amount of reward it receives. 
This means maximizing not immediate reward, but cumulative reward in the long run. We can clearly 
state this informal idea as the reward hypothesis: 

That all of what we mean by goals and purposes can be well thought of as the maximization 
of the expected value of the cumulative sum of a received scalar signal (called reward). 

The use of a reward signal to formalize the idea of a goal is one of the most distinctive features of 
reinforcement learning. 

Although formulating goals in terms of reward signals might at first appear limiting, in practice it 
has proved to be flexible and widely applicable. The best way to see this is to consider examples of how 
it has been, or could be, used. For example, to make a robot learn to walk, researchers have provided 
reward on each time step proportional to the robot’s forward motion. In making a robot learn how 
to escape from a maze, the reward is often —1 for every time step that passes prior to escape; this 
encourages the agent to escape as quickly as possible. To make a robot learn to find and collect empty 
soda cans for recycling, one might give it a reward of zero most of the time, and then a reward of +1 for 
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each can collected. One might also want to give the robot negative rewards when it bumps into things 
or when somebody yells at it. For an agent to learn to play checkers or chess, the natural rewards are 
+1 for winning, —1 for losing, and 0 for drawing and for all nonterminal positions. 

You can see what is happening in all of these examples. The agent always learns to maximize its 
reward. If we want it to do something for us, we must provide rewards to it in such a way that in 
maximizing them the agent will also achieve our goals. It is thus critical that the rewards we set up 
truly indicate what we want accomplished. In particular, the reward signal is not the place to impart 
to the agent prior knowledge about how to achieve what we want it to do. 5 For example, a chess¬ 
playing agent should be rewarded only for actually winning, not for achieving subgoals such as taking 
its opponent’s pieces or gaining control of the center of the board. If achieving these sorts of subgoals 
were rewarded, then the agent might find a way to achieve them without achieving the real goal. For 
example, it might find a way to take the opponent’s pieces even at the cost of losing the game. The 
reward signal is your way of communicating to the robot what you want it to achieve, not how you 
want it achieved. 6 


3.3 Returns and Episodes 

So far we have discussed the objective of learning informally. We have said that the agent’s goal is to 
maximize the cumulative reward it receives in the long run. How might this be defined formally? If the 
sequence of rewards received after time step t is denoted Rt+i, Rt.+2, Rt.+ 3 , ■ • ■ > then what precise aspect 
of this sequence do we wish to maximize? In general, we seek to maximize the expected return , where 
the return, denoted Gt, is defined as some specific function of the reward sequence. In the simplest case 
the return is the sum of the rewards: 


G t — Rt +1 + Rt+2 + Rt +3 + • • • + Rt, (3-7) 

where T is a final time step. This approach makes sense in applications in which there is a natural notion 
of final time step, that is, when the agent-environment interaction breaks naturally into subsequences, 
which we call episodes, 7 such as plays of a game, trips through a maze, or any sort of repeated interaction. 
Each episode ends in a special state called the terminal state, followed by a reset to a standard starting 
state or to a sample from a standard distribution of starting states. Even if you think of episodes as 
ending in different ways, such as winning and losing a game, the next episode begins independently of 
how the previous one ended. Thus the episodes can all be considered to end in the same terminal state, 
with different rewards for the different outcomes. Tasks with episodes of this kind are called episodic 
tasks. In episodic tasks we sometimes need to distinguish the set of all nonterminal states, denoted S, 
from the set of all states plus the terminal state, denoted S + . The time of termination, T, is a random 
variable that normally varies from episode to episode. 

On the other hand, in many cases the agent-environment interaction does not break naturally into 
identifiable episodes, but goes on continually without limit. For example, this would be the natural way 
to formulate an on-going process-control task, or an application to a robot with a long life span. We 
call these continuing tasks. The return formulation (3.7) is problematic for continuing tasks because 
the final time step would be T = oo, and the return, which is what we are trying to maximize, could 
itself easily be infinite. (For example, suppose the agent receives a reward of +1 at each time step.) 
Thus, in this book we usually use a definition of return that is slightly more complex conceptually but 
much simpler mathematically. 

5 Better places for imparting this kind of prior knowledge are the initial policy or initial value function, or in influences 
on these. 

6 Section 17.4 delves further into the issue of designing effective reward signals. 

' Episodes are sometimes called “trials” in the literature. 
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The additional concept that we need is that of discounting. According to this approach, the agent 
tries to select actions so that the sum of the discounted rewards it receives over the future is maximized. 
In particular, it chooses A t to maximize the expected discounted return: 

OO 

Gt = Rt+i + 7-Rt+2 + l^Rt+s 4- = l k Rt+k+ii (3-8) 

k =0 

where 7 is a parameter, 0 < 7 < 1 , called the discount rate. 

The discount rate determines the present value of future rewards: a reward received k time steps in 
the future is worth only 7 fc ~ 1 times what it would be worth if it were received immediately. If 7 < 1 , 
the infinite sum in (3.8) has a finite value as long as the reward sequence {-fffc} is bounded. If 7 = 0, 
the agent is “myopic” in being concerned only with maximizing immediate rewards: its objective in this 
case is to learn how to choose A t so as to maximize only R t + 1 . If each of the agent’s actions happened 
to influence only the immediate reward, not future rewards as well, then a myopic agent could maximize 
(3.8) by separately maximizing each immediate reward. But in general, acting to maximize immediate 
reward can reduce access to future rewards so that the return is reduced. As 7 approaches 1 , the return 
objective takes future rewards into account more strongly; the agent becomes more farsighted. 

Example 3.5: Pole-Balancing The objective in this task is to apply forces to a cart moving along 
a track so as to keep a pole hinged to the cart from falling over: A failure is said to occur if the pole 



falls past a given angle from vertical or if the cart runs off the track. The pole is reset to vertical 
after each failure. This task could be treated as episodic, where the natural episodes are the repeated 
attempts to balance the pole. The reward in this case could be +1 for every time step on which failure 
did not occur, so that the return at each time would be the number of steps until failure. In this case, 
successful balancing forever would mean a return of infinity. Alternatively, we could treat pole-balancing 
as a continuing task, using discounting. In this case the reward would be —1 on each failure and zero 
at all other times. The return at each time would then be related to —'y K , where K is the number of 
time steps before failure. In either case, the return is maximized by keeping the pole balanced for as 
long as possible. ■ 

Exercise 3.6 The equations in Section 3.1 are for the continuing case and need to be modified (very 
slightly) to apply to episodic tasks. Show that you know the modifications needed by giving the modified 
version of (3.3). □ 

Exercise 3.7 Suppose you treated pole-balancing as an episodic task but also used discounting, with 
all rewards zero except for —1 upon failure. What then would the return be at each time? How does 
this return differ from that in the discounted, continuing formulation of this task? □ 

Exercise 3.8 Imagine that you are designing a robot to run a maze. You decide to give it a reward of 
+1 for escaping from the maze and a reward of zero at all other times. The task seems to break down 
naturally into episodes—the successive runs through the maze—so you decide to treat it as an episodic 
task, where the goal is to maximize expected total reward (3.7). After running the learning agent for 
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a while, you find that it is showing no improvement in escaping from the maze. What is going wrong? 
Have you effectively communicated to the agent what you want it to achieve? □ 

Returns at successive time steps are related to each other in a way that is important for the theory 
and algorithms of reinforcement learning: 


G t = R t+ i + "/Rt+2 + 7 2 Rt+3 + l 3 Rt+4 4- 

= Rt+l + l(Rt+2 + jRt+3 + 7 2 #i+4 H-) 

= Rt+l + jGt+i (3.9) 

Note that this works for all time steps t < T, even if termination occurs at t + 1, if we define Gt = 0. 
This often makes it easy to compute returns from reward sequences. 

Exercise 3.9 Suppose 7 = 0.5 and the following sequence of rewards is received Ri = —1, R 2 = 2, 
R 3 = 6, i ?4 = 3, and R$ = 2, with T = 5. What are Go, Gi, ..., G 5 ? Hint: Work backwards. 

Note that although the return (3.8) is a sum of an infinite number of terms, it is still finite if the 
reward is nonzero and constant- if 7 < 1. For example, if the reward is a constant +1, then the return 
is 


OO 


k =0 


1 

I- 7 ' 


(3.10) 


Exercise 3.10 Suppose 7 = 0.9 and the reward sequence is R t = 2 followed by an infinite sequence of 
7s. What are Gi and Go? □ 

Exercise 3.11 Prove (3.10). □ 


3.4 Unified Notation for Episodic and Continuing Tasks 

In the preceding section we described two kinds of reinforcement learning tasks, one in which the agent- 
environment interaction naturally breaks down into a sequence of separate episodes (episodic tasks), 
and one in which it does not (continuing tasks). The former case is mathematically easier because each 
action affects only the finite number of rewards subsequently received during the episode. In this book 
we consider sometimes one kind of problem and sometimes the other, but often both. It is therefore 
useful to establish one notation that enables us to talk precisely about both cases simultaneously. 

To be precise about episodic tasks requires some additional notation. Rather than one long sequence 
of time steps, we need to consider a series of episodes, each of which consists of a finite sequence of 
time steps. We number the time steps of each episode starting anew from zero. Therefore, we have to 
refer not just to St, the state representation at time t, but to S t g, the state representation at time t of 
episode i (and similarly for A t>i , R t i , Tr t i , 7), etc.). However, it turns out that when we discuss episodic 
tasks we almost never have to distinguish between different episodes. We are almost always considering 
a particular single episode, or stating something that is true for all episodes. Accordingly, in practice 
we almost always abuse notation slightly by dropping the explicit reference to episode number. That 
is, we write St. to refer to St+, and so on. 

We need one other convention to obtain a single notation that covers both episodic and continuing 
tasks. We have defined the return as a sum over a finite number of terms in one case (3.7) and as a 
sum over an infinite number of terms in the other (3.8). These can be unified by considering episode 
termination to be the entering of a special absorbing state that transitions only to itself and that 
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generates only rewards of zero. For example, consider the state transition diagram: 

R 4 =o 
r 5 =o 


Here the solid square represents the special absorbing state corresponding to the end of an episode. 
Starting from So, we get the reward sequence +1, +1, +1, 0,0,0,.... Summing these, we get the same 
return whether we sum over the first T rewards (here T = 3) or over the full infinite sequence. This 
remains true even if we introduce discounting. Thus, we can define the return, in general, according to 
(3.8), using the convention of omitting episode numbers when they are not needed, and including the 
possibility that 7 = 1 if the sum remains defined (e.g., because all episodes terminate). Alternatively, 
we can also write the return as 

T 

G t = E (3-11) 

k=t -\-1 

including the possibility that T = 00 or 7 = 1 (but not both). We use these conventions throughout the 
rest of the book to simplify notation and to express the close parallels between episodic and continuing 
tasks. (Later, in Chapter 10, we will introduce a formulation that is both continuing and undiscounted.) 

3.5 Policies and Value Functions 

Almost all reinforcement learning algorithms involve estimating value functions —functions of states (or 
of state-action pairs) that estimate how good it is for the agent to be in a given state (or how good 
it is to perform a given action in a given state). The notion of “how good” here is defined in terms 
of future rewards that can be expected, or, to be precise, in terms of expected return. Of course the 
rewards the agent can expect to receive in the future depend on what actions it will take. Accordingly, 
value functions are defined with respect to particular ways of acting, called policies. 

Formally, a policy is a mapping from states to probabilities of selecting each possible action. If the 
agent is following policy 7 r at time t, then 7 r(a|s) is the probability that A t = a if St = s. Like p, it 
is an ordinary function; the “|” in the middle of 7r(a|s) merely reminds that it defines a probability 
distribution over a € A(s) for each s € §. Reinforcement learning methods specify how the agent’s 
policy is changed as a result of its experience. 

The value of a state s under a policy 7 r, denoted v n (s), is the expected return when starting in s and 
following 7 r thereafter. For MDPs, we can define v„ formally by 

<v(s) = E n [G t | S t = s] = E, 

where E ff [-] denotes the expected value of a random variable given that the agent follows policy 7r, and 
t is any time step. Note that the value of the terminal state, if any, is always zero. We call the function 
v n the state-value function for policy 7 r. 

Similarly, we define the value of taking action a in state s under a policy 7 r, denoted q K (s,a), as the 
expected return starting from s, taking the action a, and thereafter following policy 7r: 

OO 

q n (s,a) = E n [G t \ S t =s,A t = a] = E n ^2j k R t+k+1 S t =s,A t =a 

_k =0 





for all s € S, 



(3.13) 
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We call q K the action-value function for policy n. 

The value functions v n and q n can be estimated from experience. For example, if an agent follows 
policy 7 r and maintains an average, for each state encountered, of the actual returns that have followed 
that state, then the average will converge to the state’s value, as the number of times that state 

is encountered approaches infinity. If separate averages are kept for each action taken in each state, 
then these averages will similarly converge to the action values, q 1T {s,a). We call estimation methods 
of this kind Monte Carlo methods because they involve averaging over many random samples of actual 
returns. These kinds of methods are presented in Chapter 5. Of course, if there are very many states, 
then it may not be practical to keep separate averages for each state individually. Instead, the agent 
would have to maintain v n and q n as parameterized functions (with fewer parameters than states) and 
adjust the parameters to better match the observed returns. This can also produce accurate estimates, 
although much depends on the nature of the parameterized function approximator. These possibilities 
are discussed in Part II of the book. 

A fundamental property of value functions used throughout reinforcement learning and dynamic pro¬ 
gramming is that they satisfy recursive relationships similar to that which we have already established 
for the return (3.9). For any policy 7r and any state s, the following consistency condition holds between 
the value of s and the value of its possible successor states: 


v, r(s) = E*.[G f | S t = s] 

= + 'yGt+i | St=s] 

= '52 7r ( a \ s )'52'52 p(s’, r I s, a) |r + qE^Gt+i | S t+1 = s'] 


= ^ 7 r (a|s)^p(s / ,r|s,a) r + qn^s') 


, for all s € S, 


(by (3.9)) 


(3.14) 


where it is implicit that the actions, a, are taken from the set A(s), that the next states, s', are taken 
from the set § (or from S + in the case of an episodic problem), and that the rewards, r, are taken from 
the set Ik. Note also how in the last equation we have merged the two sums, one over all the values of 
s' and the other over all the values of r, into one sum over all the possible values of both. We use this 
kind of merged sum often to simplify formulas. Note how the final expression can be read easily as an 
expected value. It is really a sum over all values of the three variables, a, s ', and r. For each triple, 
we compute its probability, ir(a\s)p(s' ,r\s, a), weight the quantity in brackets by that probability, then 
sum over all possibilities to get an expected value. 

Equation (3.14) is the Bellman equation for v v . It expresses a relationship between the value of a state 
and the values of its successor states. Think of looking ahead from a state to 
its possible successor states, as suggested by the diagram to the right. Each 
open circle represents a state and each solid circle represents a state-action 
pair. Starting from state s, the root node at the top, the agent could take 
any of some set of actions—three are shown in the diagram— based on its 
policy 7T. From each of these, the environment could respond with one of 
several next states, s' (two are shown in the figure), along with a reward, r, 
depending on its dynamics given by the function p. The Bellman equation 
(3.14) averages over all the possibilities, weighting each by its probability Backup diagram for v n 
of occurring. It states that the value of the start state must equal the 
(discounted) value of the expected next state, plus the reward expected along the way. 

The value function v n is the unique solution to its Bellman equation. We show in subsequent chapters 
how this Bellman equation forms the basis of a number of ways to compute, approximate, and learn v n . 
We call diagrams like that above backup diagrams because they diagram relationships that form the 
basis of the update or backup operations that are at the heart of reinforcement learning methods. These 
operations transfer value information back to a state (or a state-action pair) from its successor states 


S 
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(or state-action pairs). We use backup diagrams throughout the book to provide graphical summaries 
of the algorithms we discuss. (Note that, unlike transition graphs, the state nodes of backup diagrams 
do not necessarily represent distinct states; for example, a state might be its own successor.) 

Example 3.6: Gridworld Figure 3.3 (left) shows a rectangular gridworld representation of a simple 
finite MDP. The cells of the grid correspond to the states of the environment. At each cell, four actions 
are possible: north, south, east, and west, which deterministically cause the agent to move one cell 
in the respective direction on the grid. Actions that would take the agent off the grid leave its location 
unchanged, but also result in a reward of —1. Other actions result in a reward of 0, except those that 
move the agent out of the special states A and B. From state A, all four actions yield a reward of +10 
and take the agent to A'. From state B, all actions yield a reward of +5 and take the agent to B'. 
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Figure 3.3: Gridworld example: exceptional reward dynamics (left) and state-value function for the equiprob- 
able random policy (right). 

Suppose the agent selects all four actions with equal probability in all states. Figure 3.3 (right) 
shows the value function, for this policy, for the discounted reward case with 7 = 0.9. This value 
function was computed by solving the system of linear equations (3.14). Notice the negative values near 
the lower edge; these are the result of the high probability of hitting the edge of the grid there under 
the random policy. State A is the best state to be in under this policy, but its expected return is less 
than 10, its immediate reward, because from A the agent is taken to A', from which it is likely to run 
into the edge of the grid. State B, on the other hand, is valued more than 5, its immediate reward, 
because from B the agent is taken to B', which has a positive value. From B' the expected penalty 
(negative reward) for possibly running into an edge is more than compensated for by the expected gain 
for possibly stumbling onto A or B. ■ 

Exercise 3.12 The Bellman equation (3.14) must hold for each state for the value function v n shown 
in Figure 3.3 (right) of Example 3.6. Show numerically that this equation holds for the center state, 
valued at +0.7, with respect to its four neighboring states, valued at +2.3, +0.4, —0.4, and +0.7. (These 
numbers are accurate only to one decimal place.) □ 

Exercise 3.13 What is the Bellman equation for action values, that is, for q^l It must 
give the action value q n (s, a) in terms of the action values, q n (s', a'), of possible succes¬ 
sors to the state-action pair (s,a). Hint: the backup diagram to the right corresponds 
to this equation. Show the sequence of equations analogous to (3.14), but for action 
values. □ 

Example 3.7: Golf To formulate playing a hole of golf as a reinforcement learning 
task, we count a penalty (negative reward) of —1 for each stroke until we hit the ball 
into the hole. The state is the location of the ball. The value of a state is the negative 
of the number of strokes to the hole from that location. Our actions are how we aim and swing at the 
ball, of course, and which club we select. Let us take the former as given and consider just the choice 
of club, which we assume is either a putter or a driver. The upper part of Figure 3.4 shows a possible 
state-value function, r> pu tt(s), for the policy that always uses the putter. The terminal state in-the-hole 
has a value of 0. From anywhere on the green we assume we can make a putt; these states have value 


s, a 



q-n backup diagram 
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— 1. Off the green we cannot reach the hole by putting, and the value is greater. If we can reach the 
green from a state by putting, then that state must have value one less than the green’s value, that is, 
—2. For simplicity, let us assume we can putt very precisely and deterministically, but with a limited 
range. This gives us the sharp contour line labeled —2 in the figure; all locations between that line and 
the green require exactly two strokes to complete the hole. Similarly, any location within putting range 
of the —2 contour line must have a value of —3, and so on to get all the contour lines shown in the 
figure. Putting doesn’t get us out of sand traps, so they have a value of — oo. Overall, it takes us six 
strokes to get from the tee to the hole by putting. 



q*(s, driver 



Figure 3.4: A golf example: the state-value function for putting (upper) and the optimal action-value function 
for using the driver (lower). ■ 


Exercise 3.14 In the gridworld example, rewards are positive for goals, negative for running into the 
edge of the world, and zero the rest of the time. Are the signs of these rewards important, or only 
the intervals between them? Prove, using (3.8), that adding a constant c to all the rewards adds a 
constant, v c , to the values of all states, and thus does not affect the relative values of any states under 
any policies. What is v c in terms of c and 7 ? □ 

Exercise 3.15 Now consider adding a constant c to all the rewards in an episodic task, such as maze 
running. Would this have any effect, or would it leave the task unchanged as in the continuing task 
above? Why or why not? Give an example. □ 


Exercise 3.16 The value of a state depends on the values of the actions possible in that state and on 
how likely each action is to be taken under the current policy. We can think of this in terms of a small 
backup diagram rooted at the state and considering each possible action: 

taken with 
probability 7r(a|s) 



Give the equation corresponding to this intuition and diagram for the value at the root node, v n (s), in 
terms of the value at the expected leaf node, q^^s^a), given S t = s. This equation should include an 
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expectation conditioned on following the policy, tt. Then give a second equation in which the expected 
value is written out explicitly in terms of 7r(a|s) such that no expected value notation appears in the 
equation. □ 


Exercise 3.17 The value of an action, q n (s, a), depends on the expected next reward and the expected 
sum of the remaining rewards. Again we can think of this in terms of a small backup diagram, this one 
rooted at an action (state-action pair) and branching to the possible next states: 


expected 

rewards' 


8, 

/ 

0 


— qw{s,a 

V2 

' X C3\ 


_.s— V w (s] 

C 

) 

x o - 


Give the equation corresponding to this intuition and diagram for the action value, q v (s,a), in terms 
of the expected next reward, Rt+i, and the expected next state value, v 7V (St+t ), given that St = s and 
A t =a. This equation should include an expectation but not one conditioned conditioned on following 
the policy. Then give a second equation, writing out the expected value explicitly in terms of p(s', r \ s, a) 
defined by (3.2), such that no expected value notation appears in the equation. □ 


3.6 Optimal Policies and Optimal Value Functions 

Solving a reinforcement learning task means, roughly, finding a policy that achieves a lot of reward over 
the long run. For finite MDPs, we can precisely define an optimal policy in the following way. Value 
functions define a partial ordering over policies. A policy tt is defined to be better than or equal to 
a policy tt' if its expected return is greater than or equal to that of tt' for all states. In other words, 
tt > n' ii and only if v n (s) > v„'(s) for all s € S. There is always at least one policy that is better 
than or equal to all other policies. This is an optimal policy. Although there may be more than one, 
we denote all the optimal policies by t r*. They share the same state-value function, called the optimal 
state-value function, denoted v*, and defined as 

u*(s) = maxims), (3.15) 


for all s € S. 

Optimal policies also share the same optimal action-value function, denoted g*, and defined as 
q*(s, a) = maxg 7r (s, a), (3.16) 

7r 

for all s £ § and a € A(s). For the state-action pair (s, a), this function gives the expected return for 
taking action a in state s and thereafter following an optimal policy. Thus, we can write g* in terms of 
u* as follows: 


q*(s, a) = E[i?t+i + 7 i>*(£)-|-i) | S t = s,A t =a\. (3.17) 

Example 3.8: Optimal Value Functions for Golf The lower part of Figure 3.4 shows the contours 
of a possible optimal action-value function g*(s, driver). These are the values of each state if we first 
play a stroke with the driver and afterward select either the driver or the putter, whichever is better. 
The driver enables us to hit the ball farther, but with less accuracy. We can reach the hole in one shot 
using the driver only if we are already very close; thus the —1 contour for g*(s, driver) covers only 
a small portion of the green. If we have two strokes, however, then we can reach the hole from much 
farther away, as shown by the —2 contour. In this case we don’t have to drive all the way to within the 
small —1 contour, but only to anywhere on the green; from there we can use the putter. The optimal 
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action-value function gives the values after committing to a particular first action, in this case, to the 
driver, but afterward using whichever actions are best. The —3 contour is still farther out and includes 
the starting tee. From the tee, the best sequence of actions is two drives and one putt, sinking the ball 
in three strokes. ■ 

Because v* is the value function for a policy, it must satisfy the self-consistency condition given by 
the Bellman equation for state values (3.14). Because it is the optimal value function, however, u*’s 
consistency condition can be written in a special form without reference to any specific policy. This is 
the Bellman equation for w*, or the Bellman optimality equation. Intuitively, the Bellman optimality 
equation expresses the fact that the value of a state under an optimal policy must equal the expected 
return for the best action from that state: 


u*(s) = max q n ,{s,a) 

a€A(s) 

= maxEjJGf | S t = s, A t =a\ 

a 

= maxE 7 r „[.R t+ i + jG t+ i \ S t =s,A t =a] 

a 

= maxE[i? t+ 1 + 7 n*(S' t+ i) | S t = s,A t = a\ 

a 

= max) p(s / ,r|s,a)|r-t- 7 U*(s , )l. 

a zJ L J 

s' ,r 


(by (3.9)) 

(3.18) 

(3.19) 


The last two equations are two forms of the Bellman optimality equation for u*. The Bellman optimality 
equation for g* is 


q*(s, a) = E U f+ i+ 7 maxg,( 5 t + i,a') 

a' 


S t = s. At = 


= > p(s',r\s,a) r + 7 max< 7 *(s', a') 

zJ L a' 

s' ,r 


(3.20) 


The backup diagrams in Figure 3.5 show graphically the spans of future states and actions considered 
in the Bellman optimality equations for i>* and g*. These are the same as the backup diagrams for v T 
and q n presented earlier except that arcs have been added at the agent’s choice points to represent that 
the maximum over that choice is taken rather than the expected value given some policy. The backup 
diagram on the left graphically represents the Bellman optimality equation (3.19) and the backup 
diagram on the right graphically represents (3.20). 



Figure 3.5: Backup diagrams for v » and q * 

For finite MDPs, the Bellman optimality equation for v v (3.19) has a unique solution independent 
of the policy. The Bellman optimality equation is actually a system of equations, one for each state, so 
if there are n states, then there are n equations in n unknowns. If the dynamics p of the environment 
are known, then in principle one can solve this system of equations for u* using any one of a variety of 
methods for solving systems of nonlinear equations. One can solve a related set of equations for g*. 

Once one has u*, it is relatively easy to determine an optimal policy. For each state s, there will be 
one or more actions at which the maximum is obtained in the Bellman optimality equation. Any policy 
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that assigns nonzero probability only to these actions is an optimal policy. You can think of this as a 
one-step search. If you have the optimal value function, v*, then the actions that appear best after a 
one-step search will be optimal actions. Another way of saying this is that any policy that is greedy 
with respect to the optimal evaluation function 14 is an optimal policy. The term greedy is used in 
computer science to describe any search or decision procedure that selects alternatives based only on 
local or immediate considerations, without considering the possibility that such a selection may prevent 
future access to even better alternatives. Consequently, it describes policies that select actions based 
only on their short-term consequences. The beauty of i>* is that if one uses it to evaluate the short¬ 
term consequences of actions—specifically, the one-step consequences—then a greedy policy is actually 
optimal in the long-term sense in which we are interested because 14 already takes into account the 
reward consequences of all possible future behavior. By means of u*, the optimal expected long-term 
return is turned into a quantity that is locally and immediately available for each state. Hence, a 
one-step-ahead search yields the long-term optimal actions. 

Having q * makes choosing optimal actions even easier. With g*, the agent does not even have to 
do a one-step-ahead search: for any state s, it can simply find any action that maximizes q*(s, a). 
The action-value function effectively caches the results of all one-step-ahead searches. It provides the 
optimal expected long-term return as a value that is locally and immediately available for each state- 
action pair. Hence, at the cost of representing a function of state-action pairs, instead of just of states, 
the optimal action-value function allows optimal actions to be selected without having to know anything 
about possible successor states and their values, that is, without having to know anything about the 
environment’s dynamics. 

Example 3.9: Bellman Optimality Equations for the Recycling Robot Using (3.19), we 
can explicitly give the Bellman optimality equation for the recycling robot example. To make things 
more compact, we abbreviate the states high and low, and the actions search, wait, and recharge 
respectively by h, 1 , s, w, and re. Since there are only two states, the Bellman optimality equation 
consists of two equations. The equation for 14 (h) can be written as follows: 


► (h) = 


max 


= max 


= max 


f>(h|h,s)[r(h,s,h) + 7 V»(h)] + p(l |h, s)[r(h, s, 1) 
p(h|h,w)[r(h,w,h) + 7 n*(h)] + p(l |h, w)[r(h, w, 1) 

a[r s + 714 (h)] + (1 - a)[r B + 714 ( 1 )], 
l[r w + 714 (h)] + 0 [r w + 7 - 14 ( 1 )] 

r s + 7 [an* (h) + (1 - a)u*(l)], 

r v + 7+-(h) 


- 714(1)], 
■ 714(1)] 


Following the same procedure for 14(1) yields the equation 

( Pr s - 3(1 - P)+ 7[(1 - /?)u*(h) + / 3 n*(l)] 
u*(l) = max < 74 + 714(1), 

{ 7«*(h) 

For any choice of r s , r w , a, / 3 , and 7, with 0 < 7 < 1 , 0 < a, ft < 1 , there is exactly one pair of numbers, 
14(h) and 14(1), that simultaneously satisfy these two nonlinear equations. ■ 

Example 3.10: Solving the Gridworld Suppose we solve the Bellman equation for 74 for the 
simple grid task introduced in Example 3.6 and shown again in Figure 3.6 (left). Recall that state 
A is followed by a reward of +10 and transition to state A', while state B is followed by a reward of 
+5 and transition to state B'. Figure 3.6 (middle) shows the optimal value function, and Figure 3.6 
(right) shows the corresponding optimal policies. Where there are multiple arrows in a cell, all of the 
corresponding actions are optimal. 
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Figure 3.6: Optimal solutions to the gridworld example. 


Explicitly solving the Bellman optimality equation provides one route to finding an optimal policy, 
and thus to solving the reinforcement learning problem. However, this solution is rarely directly useful. 
It is akin to an exhaustive search, looking ahead at all possibilities, computing their probabilities of 
occurrence and their desirabilities in terms of expected rewards. This solution relies on at least three 
assumptions that are rarely true in practice: (1) we accurately know the dynamics of the environment; 
(2) we have enough computational resources to complete the computation of the solution; and (3) 
the Markov property. For the kinds of tasks in which we are interested, one is generally not able to 
implement this solution exactly because various combinations of these assumptions are violated. For 
example, although the first and third assumptions present no problems for the game of backgammon, 
the second is a major impediment. Since the game has about 10 20 states, it would take thousands of 
years on today’s fastest computers to solve the Bellman equation for u*, and the same is true for finding 
q *. In reinforcement learning one typically has to settle for approximate solutions. 

Many different decision-making methods can be viewed as ways of approximately solving the Bellman 
optimality equation. For example, heuristic search methods can be viewed as expanding the right-hand 
side of (3.19) several times, up to some depth, forming a “tree” of possibilities, and then using a heuristic 
evaluation function to approximate v* at the “leaf” nodes. (Heuristic search methods such as A* are 
almost always based on the episodic case.) The methods of dynamic programming can be related 
even more closely to the Bellman optimality equation. Many reinforcement learning methods can be 
clearly understood as approximately solving the Bellman optimality equation, using actual experienced 
transitions in place of knowledge of the expected transitions. We consider a variety of such methods in 
the following chapters. 

Exercise 3.18 Draw or describe the optimal state-value function for the golf example. □ 

Exercise 3.19 Draw or describe the contours of the optimal action-value function for putting, 
q*(s, putter), for the golf example. □ 

Exercise 3.20 Give the Bellman equation for q * for the recycling robot. □ 

Exercise 3.21 Figure 3.6 gives the optimal value of the best state of the gridworld as 24.4, to one 
decimal place. Use your knowledge of the optimal policy and (3.8) to express this value symbolically, 
and then to compute it to three decimal places. □ 
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Exercise 3.22 Consider the continuing MDP shown on to the 
right. The only decision to be made is that in the top state, 
where two actions are available, left and right. The numbers 
show the rewards that are received deterministically after each 
action. There are exactly two deterministic policies, 7T| e ft and 
right• What policy is optimal if 7 = 0? If 7 = 0.9? If 7 = 0.5? □ 

Exercise 3.23 Give an equation for v„. in terms of < 7 *. □ 



Exercise 3.24 
Exercise 3.25 
Exercise 3.26 


Give an equation for g* in terms of v* and the world’s dynamics, p(s' ,r\s,a). 
Give an equation for 7 r* in terms of q*. 

Give an equation for 7 r* in terms of u* and the world’s dynamics, p(s' ,r\s, a). 


□ 

□ 

□ 


3.7 Optimality and Approximation 

We have defined optimal value functions and optimal policies. Clearly, an agent that learns an optimal 
policy has done very well, but in practice this rarely happens. For the kinds of tasks in which we are 
interested, optimal policies can be generated only with extreme computational cost. A well-defined 
notion of optimality organizes the approach to learning we describe in this book and provides a way to 
understand the theoretical properties of various learning algorithms, but it is an ideal that agents can 
only approximate to varying degrees. As we discussed above, even if we have a complete and accurate 
model of the environment’s dynamics, it is usually not possible to simply compute an optimal policy by 
solving the Bellman optimality equation. For example, board games such as chess are a tiny fraction 
of human experience, yet large, custom-designed computers still cannot compute the optimal moves. 
A critical aspect of the problem facing the agent is always the computational power available to it, in 
particular, the amount of computation it can perform in a single time step. 

The memory available is also an important constraint. A large amount of memory is often required 
to build up approximations of value functions, policies, and models. In tasks with small, finite state 
sets, it is possible to form these approximations using arrays or tables with one entry for each state 
(or state-action pair). This we call the tabular case, and the corresponding methods we call tabular 
methods. In many cases of practical interest, however, there are far more states than could possibly be 
entries in a table. In these cases the functions must be approximated, using some sort of more compact 
parameterized function representation. 

Our framing of the reinforcement learning problem forces us to settle for approximations. However, 
it also presents us with some unique opportunities for achieving useful approximations. For example, 
in approximating optimal behavior, there may be many states that the agent faces with such a low 
probability that selecting suboptimal actions for them has little impact on the amount of reward the 
agent receives. Tesauro’s backgammon player, for example, plays with exceptional skill even though 
it might make very bad decisions on board configurations that never occur in games against experts. 
In fact, it is possible that TD-Gannnon makes bad decisions for a large fraction of the game’s state 
set. The on-line nature of reinforcement learning makes it possible to approximate optimal policies in 
ways that put more effort into learning to make good decisions for frequently encountered states, at the 
expense of less effort for infrequently encountered states. This is one key property that distinguishes 
reinforcement learning from other approaches to approximately solving MDPs. 
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3.8 Summary 

Let us summarize the elements of the reinforcement learning problem that we have presented in this 
chapter. Reinforcement learning is about learning from interaction how to behave in order to achieve 
a goal. The reinforcement learning agent and its environment interact over a sequence of discrete time 
steps. The specification of their interface defines a particular task: the actions are the choices made by 
the agent; the states are the basis for making the choices; and the rewards are the basis for evaluating 
the choices. Everything inside the agent is completely known and controllable by the agent; everything 
outside is incompletely controllable but may or may not be completely known. A policy is a stochastic 
rule by which the agent selects actions as a function of states. The agent’s objective is to maximize the 
amount of reward it receives over time. 

When the reinforcement learning setup described above is formulated with well defined transition 
probabilities it constitutes a Markov decision process (MDP). A finite MDP is an MDP with finite state, 
action, and (as we formulate it here) reward sets. Much of the current theory of reinforcement learning 
is restricted to finite MDPs, but the methods and ideas apply more generally. 

The return is the function of future rewards that the agent seeks to maximize. It has several different 
definitions depending upon the nature of the task and whether one wishes to discount delayed reward. 
The undiscounted formulation is appropriate for episodic tasks, in which the agent-environment inter¬ 
action breaks naturally into episodes; the discounted formulation is appropriate for continuing tasks, 
in which the interaction does not naturally break into episodes but continues without limit. We try to 
define the returns for the two kinds of tasks such that one set of equations can apply to both both the 
episodic and continuing cases. 

A policy’s value functions assign to each state, or state-action pair, the expected return from that 
state, or state-action pair, given that the agent uses the policy. The optimal value functions assign to 
each state, or state-action pair, the largest expected return achievable by any policy. A policy whose 
value functions are optimal is an optimal policy. Whereas the optimal value functions for states and 
state-action pairs are unique for a given MDP, there can be many optimal policies. Any policy that is 
greedy with respect to the optimal value functions must be an optimal policy. The Bellman optimality 
equations are special consistency conditions that the optimal value functions must satisfy and that can, 
in principle, be solved for the optimal value functions, from which an optimal policy can be determined 
with relative ease. 

A reinforcement learning problem can be posed in a variety of different ways depending on assump¬ 
tions about the level of knowledge initially available to the agent. In problems of complete knowledge, 
the agent has a complete and accurate model of the environment’s dynamics. If the environment is an 
MDP, then such a model consists of the complete four-argument dynamics function p (3.2). In problems 
of incomplete knowledge, a complete and perfect model of the environment is not available. 

Even if the agent has a complete and accurate environment model, the agent is typically unable to 
perform enough computation per time step to fully use it. The memory available is also an important 
constraint. Memory may be required to build up accurate approximations of value functions, policies, 
and models. In most cases of practical interest there are far more states than could possibly be entries 
in a table, and approximations must be made. 

A well-defined notion of optimality organizes the approach to learning we describe in this book and 
provides a way to understand the theoretical properties of various learning algorithms, but it is an 
ideal that reinforcement learning agents can only approximate to varying degrees. In reinforcement 
learning we are very much concerned with cases in which optimal solutions cannot be found but must 
be approximated in some way. 
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Bibliographical and Historical Remarks 

The reinforcement learning problem is deeply indebted to the idea of Markov decision processes (MDPs) 
from the field of optimal control. These historical influences and other major influences from psychology 
are described in the brief history given in Chapter 1. Reinforcement learning adds to MDPs a focus on 
approximation and incomplete information for realistically large problems. MDPs and the reinforcement 
learning problem are only weakly linked to traditional learning and decision-making problems in artificial 
intelligence. However, artificial intelligence is now vigorously exploring MDP formulations for planning 
and decision making from a variety of perspectives. MDPs are more general than previous formulations 
used in artificial intelligence in that they permit more general kinds of goals and uncertainty. 

The theory of MDPs is treated by, e.g., Bertsekas (2005), White (1969), Whittle (1982, 1983), and 
Puterman (1994). A particularly compact treatment of the finite case is given by Ross (1983). MDPs are 
also studied under the heading of stochastic optimal control, where adaptive optimal control methods 
are most closely related to reinforcement learning (e.g., Kumar, 1985; Kumar and Varaiya, 1986). 

The theory of MDPs evolved from efforts to understand the problem of making sequences of decisions 
under uncertainty, where each decision can depend on the previous decisions and their outcomes. It is 
sometimes called the theory of multistage decision processes, or sequential decision processes, and has 
roots in the statistical literature on sequential sampling beginning with the papers by Thompson (1933, 
1934) and Robbins (1952) that we cited in Chapter 2 in connection with bandit problems (which are 
prototypical MDPs if formulated as multiple-situation problems). 

The earliest instance of which we are aware in which reinforcement learning was discussed using 
the MDP formalism is Andreae’s (1969b) description of a unified view of learning machines. Witten 
and Corbin (1973) experimented with a reinforcement learning system later analyzed by Witten (1977) 
using the MDP formalism. Although he did not explicitly mention MDPs, Werbos (1977) suggested 
approximate solution methods for stochastic optimal control problems that are related to modern re¬ 
inforcement learning methods (see also Werbos, 1982, 1987, 1988, 1989, 1992). Although Werbos’s 
ideas were not widely recognized at the time, they were prescient in emphasizing the importance of 
approximately solving optimal control problems in a variety of domains, including artificial intelligence. 
The most influential integration of reinforcement learning and MDPs is due to Watkins (1989). 

3.1 Our characterization of the dynamics of an MDP in terms of p(s',r\ s, a) is slightly unusual. It 
is more common in the MDP literature to describe the dynamics in terms of the state transition 
probabilities p(s l \s,a) and expected next rewards r(s,a). In reinforcement learning, however, 
we more often have to refer to individual actual or sample rewards (rather than just their 
expected values). Our notation also makes it plainer that St and Rt are in general jointly 
determined, and thus must have the same time index. In teaching reinforcement learning, we 
have found our notation to be more straightforward conceptually and easier to understand. 

For a good intuitive discussion of the system-theoretic concept of state, see Minsky (1967). 

The bioreactor example is based on the work of Ungar (1990) and Miller and Williams (1992). 
The recycling robot example was inspired by the can-collecting robot built by Jonathan Connell 
(1989). 

3.2 The reward hypothesis was suggested by Michael Littman (personal communication). 

3.3—4 The terminology of episodic and continuing tasks is different from that usually used in the 
MDP literature. In that literature it is common to distinguish three types of tasks: (1) finite- 
horizon tasks, in which interaction terminates after a particular fixed number of time steps; 
(2) indefinite-horizon tasks, in which interaction can last arbitrarily long but must eventually 
terminate; and (3) infinite-horizon tasks, in which interaction does not terminate. Our episodic 
and continuing tasks are similar to indefinite-horizon and infinite-horizon tasks, respectively, 
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but we prefer to emphasize the difference in the nature of the interaction. This difference 
seems more fundamental than the difference in the objective functions emphasized by the usual 
terms. Often episodic tasks use an indefinite-horizon objective function and continuing tasks 
an infinite-horizon objective function, but we see this as a common coincidence rather than a 
fundamental difference. 

The pole-balancing example is from Michie and Chambers (1968) and Barto, Sutton, and 
Anderson (1983). 

3.5—6 Assigning value on the basis of what is good or bad in the long run has ancient roots. In 
control theory, mapping states to numerical values representing the long-term consequences of 
control decisions is a key part of optimal control theory, which was developed in the 1950s by 
extending nineteenth century state-function theories of classical mechanics (see, e.g., Schultz 
and Melsa, 1967). In describing how a computer could be programmed to play chess, Shannon 
(1950) suggested using an evaluation function that took into account the long-term advantages 
and disadvantages of chess positions. 

Watkins’s (1989) Q-learning algorithm for estimating < 7 * (Chapter 6 ) made action-value func¬ 
tions an important part of reinforcement learning, and consequently these functions are often 
called “Q-functions.” But the idea of an action-value function is much older than this. Shan¬ 
non (1950) suggested that a function h(P,M) could be used by a chess-playing program to 
decide whether a move M in position P is worth exploring. Michie’s (1961, 1963) MENACE 
system and Michie and Chambers’s (1968) BOXES system can be understood as estimating 
action-value functions. In classical physics, Hamilton’s principal function is an action-value 
function; Newtonian dynamics are greedy with respect to this function (e.g., Goldstein, 1957). 
Action-value functions also played a central role in Denardo’s (1967) theoretical treatment of 
DP in terms of contraction mappings. 

What we call the Bellman equation for i>* was popularized by Richard Bellman (1957a), who 
called it the “basic functional equation.” The counterpart of the Bellman optimality equation 
for continuous time and state problems is known as the Hamilton Jacobi-Bellman equation (or 
often just the Hamilton-Jacobi equation), indicating its roots in classical physics (e.g., Schultz 
and Melsa, 1967). 

The golf example was suggested by Chris Watkins. 
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Chapter 4 


Dynamic 


P rogramming 


The term dynamic programming (DP) refers to a collection of algorithms that can be used to compute 
optimal policies given a perfect model of the environment as a Markov decision process (MDP). Classi¬ 
cal DP algorithms are of limited utility in reinforcement learning both because of their assumption of a 
perfect model and because of their great computational expense, but they are still important theoreti¬ 
cally. DP provides an essential foundation for the understanding of the methods presented in the rest 
of this book. In fact, all of these methods can be viewed as attempts to achieve much the same effect 
as DP, only with less computation and without assuming a perfect model of the environment. 

Starting with this chapter, we usually assume that the environment is a finite MDP. That is, we 
assume that its state, action, and reward sets, S, A, and Ik, are finite, and that its dynamics are given 
by a set of probabilities p(s',r\s, a), for all s £ S, a € A(s), r € Ik, and s' € S + (S + is § plus a terminal 
state if the problem is episodic). Although DP ideas can be applied to problems with continuous state 
and action spaces, exact solutions are possible only in special cases. A common way of obtaining 
approximate solutions for tasks with continuous states and actions is to quantize the state and action 
spaces and then apply finite-state DP methods. The methods we explore in Chapter 9 are applicable 
to continuous problems and are a significant extension of that approach. 

The key idea of DP, and of reinforcement learning generally, is the use of value functions to organize 
and structure the search for good policies. In this chapter we show how DP can be used to compute 
the value functions defined in Chapter 3. As discussed there, we can easily obtain optimal policies once 
we have found the optimal value functions, u* or q t , which satisfy the Bellman optimality equations: 


v*{s) 


maxE[i? t+ i + 7 D*(St+i) | S t =s,A t =a] 


max 

a 


E 

ft' r 


p(s', r |s, a) r + 7 u*(s') 


(4.1) 


or 


< 7 *(s, a) = E -Rt+i + 7 max g* (S t +i , a') 

a' 


S t = s,A t = a 


= } p(s',r\s,a) r + 7 max< 7 *(s', a') 

^' L a' 

s' ,r 


(4.2) 


for all s £ S, a £ A(s), and s' £ S + . As we shall see, DP algorithms are obtained by turning Bellman 
equations such as these into assignments, that is, into update rules for improving approximations of the 
desired value functions. 
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4.1 Policy Evaluation (Prediction) 


First we consider how to compute the state-value function v n for an arbitrary policy n. This is called 
policy evaluation in the DP literature. We also refer to it as the prediction problem. Recall from 
Chapter 3 that, for all s € S, 


v w (s ) = E n [G t | S t = s] 

= En-[i? t+ i + 'yGt.+i | £t=s] 

= E^[i ? t+1 + 7u w (5' t+ i) | St = s] 

= ^] 7 r (a|s)^p(s , ,r|s,a) r + 'yv^s') 

a s' ,r 


(from (3.9)) 

(4.3) 

(4.4) 


where 7r(a|s) is the probability of taking action a in state s under policy n, and the expectations are 
subscripted by n to indicate that they are conditional on n being followed. The existence and uniqueness 
of v n are guaranteed as long as either 7 < 1 or eventual termination is guaranteed from all states under 
the policy ir. 

If the environment’s dynamics are completely known, then (4.4) is a system of |S| simultaneous linear 
equations in |S| unknowns (the 17 (s), s € §). In principle, its solution is a straightforward, if tedious, 
computation. For our purposes, iterative solution methods are most suitable. Consider a sequence 
of approximate value functions Vq, tq, V 2 , ■ ■each mapping S + to R (the real numbers). The initial 
approximation, vo, is chosen arbitrarily (except that the terminal state, if any, must be given value 0 ), 
and each successive approximation is obtained by using the Bellman equation for v n (4.4) as an update 
rule: 


Vk+l(s) 


EjRt+i +'yv k (S t +i) | S t = s] 

X! 7 r (°l s ) X! p ( s '’ r I s ’ a ) \ r + 'Wfc(s') 


(4.5) 


for all s € §. Clearly, 17 = v n is a fixed point for this update rule because the Bellman equation for v 
assures us of equality in this case. Indeed, the sequence { 17 -} can be shown in general to converge to 
v n as k —> 00 under the same conditions that guarantee the existence of v n . This algorithm is called 
iterative policy evaluation. 

To produce each successive approximation, Vk+i from 17 , iterative policy evaluation applies the same 
operation to each state s: it replaces the old value of s with a new value obtained from the old values of 
the successor states of s, and the expected immediate rewards, along all the one-step transitions possible 
under the policy being evaluated. We call this kind of operation an expected update. Each iteration of 
iterative policy evaluation updates the value of every state once to produce the new approximate value 
function 17 + 1 . There are several different kinds of expected updates, depending on whether a state (as 
here) or a state-action pair is being updated, and depending on the precise way the estimated values of 
the successor states are combined. All the updates done in DP algorithms are called expected updates 
because they are based on an expectation over all possible next states rather than on a sample next 
state. The nature of an update can be expressed in an equation, as above, or in an backup diagram 
like those introduced in Chapter 3. For example, the backup diagram corresponding to the expected 
update used in iterative policy evaluation is shown on page 47. 

To write a sequential computer program to implement iterative policy evaluation as given by (4.5) 
you would have to use two arrays, one for the old values, Vk(s), and one for the new values, 17+1 (s). 
With two arrays, the new values can be computed one by one from the old values without the old values 
being changed. Of course it is easier to use one array and update the values “in place,” that is, with 
each new value immediately overwriting the old one. Then, depending on the order in which the states 
are updated, sometimes new values are used instead of old ones on the right-hand side of (4.5). This 
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in-place algorithm also converges to v in fact, it usually converges faster than the two-array version, 
as you might expect, since it uses new data as soon as they are available. We think of the updates as 
being done in a sweep through the state space. For the in-place algorithm, the order in which states 
have their values updated during the sweep has a significant influence on the rate of convergence. We 
usually have the in-place version in mind when we think of DP algorithms. 

A complete in-place version of iterative policy evaluation is shown in the box below. Note how it 
handles termination. Formally, iterative policy evaluation converges only in the limit, but in practice 
it must be halted short of this. The boxed algorithm tests the quantity max se g |ufc + i(s) — u/-(s)| after 
each sweep and stops when it is sufficiently small. 



Example 4.1 Consider the 4x4 gridworld shown below. 



actions 



1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 



Rt — — 1 
on all transitions 


The nonterminal states are § = {1,2,..., 14}. There are four actions possible in each state, A = 
{up, down, right, left}, which deterministically cause the corresponding state transitions, except that 
actions that would take the agent off the grid in fact leave the state unchanged. Thus, for instance, 
p(6, — 115, right) = 1, p(7, — 1 | 7, right) = 1, and p(10, r | 5, right) = 0 for all r € A. This is an 
undiscounted, episodic task. The reward is —1 on all transitions until the terminal state is reached. 
The terminal state is shaded in the figure (although it is shown in two places, it is formally one state). 
The expected reward function is thus r(s,a, s') = —1 for all states s, s' and actions a. Suppose the 
agent follows the equiprobable random policy (all actions equally likely). The left side of Figure 4.1 
shows the sequence of value functions {r'fe} computed by iterative policy evaluation. The final estimate 
is in fact which in this case gives for each state the negation of the expected number of steps from 
that state until termination. ■ 

Exercise 4.1 In Example 4.1, if n is the equiprobable random policy, what is g T (ll, down)? What is 
qq.(7, down)? □ 

Exercise 4.2 In Example 4.1, suppose a new state 15 is added to the gridworld just below state 13, 
and its actions, left, up, right, and down, take the agent to states 12, 13, 14, and 15, respectively. 
Assume that the transitions from the original states are unchanged. What, then, is u„.(15) for the 
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Figure 4.1: Convergence of iterative policy evaluation on a small gridworld. The left column is the sequence 
of approximations of the state-value function for the random policy (all actions equal). The right column is 
the sequence of greedy policies corresponding to the value function estimates (arrows are shown for all actions 
achieving the maximum). The last policy is guaranteed only to be an improvement over the random policy, but 
in this case it, and all policies after the third iteration, are optimal. 


equiprobable random policy? Now suppose the dynamics of state 13 are also changed, such that action 
down from state 13 takes the agent to the new state 15. What is 1^(15) for the equiprobable random 
policy in this case? □ 

Exercise 4.3 What are the equations analogous to (4.3), (4.4), and (4.5) for the action-value function 
q n and its successive approximation by a sequence of functions q 0 ,qi 1 q 2 ,... ? □ 


4.2 Policy Improvement 

Our reason for computing the value function for a policy is to help find better policies. Suppose we 
have determined the value function for an arbitrary deterministic policy n. For some state s we 
would like to know whether or not we should change the policy to deterministically choose an action 
a y 7 r(s). We know how good it is to follow the current policy from s —that is v v (s)— but would it be 
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better or worse to change to the new policy? One way to answer this question is to consider selecting 
a in s and thereafter following the existing policy, ir. The value of this way of behaving is 

q-ir{s, a) = E[i? t+ i +jv v (S t+ i) \ S t =s,A t =a\ (4.6) 

= ^p(s',r|s,a) r + 7 - 17 . (s') . 

s' ,r 

The key criterion is whether this is greater than or less than v n (s). If it is greater—that is, if it is better 
to select a once in s and thereafter follow i r than it would be to follow 7 r all the time—then one would 
expect it to be better still to select a every time s is encountered, and that the new policy would in 
fact be a better one overall. 

That this is true is a special case of a general result called the policy improvement theorem. Let it 
and 7 r' be any pair of deterministic policies such that, for all s€§, 

q 7r (s,n'(s)) > v n (s). (4.7) 

Then the policy n' must be as good as, or better than, n. That is, it must obtain greater or equal 
expected return from all states s £ S: 

*v(s) > v n (s). (4.8) 

Moreover, if there is strict inequality of (4.7) at any state, then there must be strict inequality of (4.8) 
at least one state. This result applies in particular to the two policies that we considered in the previous 
paragraph, an original deterministic policy, 7r, and a changed policy, n', that is identical to 7r except 
that 7 r'(s) = a ^ 7 r(s). Obviously, (4.7) holds at all states other than s. Thus, if q 7r (s,a) > v n (s), then 
the changed policy is indeed better than 7r. 

The idea behind the proof of the policy improvement theorem is easy to understand. Starting from 
(4.7), we keep expanding the q n side with (4.6) and reapplying (4.7) until we get rv(s): 

'Or(s) < q K (s,n'(s)) 

= E[i? t+ i +jv v (S t+ i) | S t =s,A t = n\a)} (by (4.6)) 

= E^[R t+1 +'yv v {S t +i) | S t =s} 

< E n ,[R t+1 +'yq 7I (S t+1 ,TT l (S t +i)) \ S t = s] 

= + 7®7r'[-fil*+2 + 7U T (S't+2)] | *S) = s] 

= E T /[i? t+ i + 7?? t+ 2 + 7 2 r’ 7r (S't+2) | /St = s] 

< E^/[i? t+ i + 7?? t+ 2 + 7 2 i?t+3 + 7 3 r) w (S't+3) | St^s] 

< E^/[i ? t+1 + 7?? t+ 2 + 7 2 i?t +3 + 7 3 i?t +4 + • • • | S't=s] 

= 'iv(s)- 

So far we have seen how, given a policy and its value function, we can easily evaluate a change in the 
policy at a single state to a particular action. It is a natural extension to consider changes at all states 
and to all possible actions, selecting at each state the action that appears best according to q n (s,a). 
In other words, to consider the new greedy policy, 7 r', given by 

^'(s) = argmax( 7 ^(s, a) 

a 

= argmaxE[I? t+ i+ 7 tv(S' t+ i) | S t = s,A t = a] 

a 

= argmax^^p(s , , r |s, a) r + "fv v (s') , 

a , L 

s',r 


(4.9) 
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where argmax a denotes the value of a at which the expression that follows is maximized (with ties 
broken arbitrarily). The greedy policy takes the action that looks best in the short term—after one 
step of lookahead—according to v n . By construction, the greedy policy meets the conditions of the 
policy improvement theorem (4.7), so we know that it is as good as, or better than, the original policy. 
The process of making a new policy that improves on an original policy, by making it greedy with 
respect to the value function of the original policy, is called policy improvement. 

Suppose the new greedy policy, 7r', is as good as, but not better than, the old policy it. Then v n = tv, 
and from (4.9) it follows that for all s £ S: 


Vn'(s) 


maxE[i? t+ i + qrv (4+i) | S t = s,A t =a] 


max p{s',r \ s, a) r + jv^^s') 


But this is the same as the Bellman optimality equation (4.1), and therefore, v n > must be u*, and both 
7r and tt' must be optimal policies. Policy improvement thus must give us a strictly better policy except 
when the original policy is already optimal. 

So far in this section we have considered the special case of deterministic policies. In the general case, 
a stochastic policy 7r specifies probabilities, 7r(a|s), for taking each action, a, in each state, s. We will 
not go through the details, but in fact all the ideas of this section extend easily to stochastic policies. 
In particular, the policy improvement theorem carries through as stated for the stochastic case. In 
addition, if there are ties in policy improvement steps such as (4.9)—that is, if there are several actions 
at which the maximum is achieved—then in the stochastic case we need not select a single action from 
among them. Instead, each maximizing action can be given a portion of the probability of being selected 
in the new greedy policy. Any apportioning scheme is allowed as long as all submaximal actions are 
given zero probability. 

The last row of Figure 4.1 shows an example of policy improvement for stochastic policies. Here the 
original policy, 7r, is the equiprobable random policy, and the new policy, 7r', is greedy with respect 
to v n . The value function v w is shown in the bottom-left diagram and the set of possible 7 r' is shown 
in the bottom-right diagram. The states with multiple arrows in the tt' diagram are those in which 
several actions achieve the maximum in (4.9); any apportionment of probability among these actions is 
permitted. The value function of any such policy, tv( s )> can be seen by inspection to be either —1, —2, 
or —3 at all states, s G S, whereas v v (s) is at most —14. Thus, iv(s) > 7'7r(s), for all s € S, illustrating 
policy improvement. Although in this case the new policy tt' happens to be optimal, in general only an 
improvement is guaranteed. 


4.3 Policy Iteration 

Once a policy, 7r, has been improved using v n to yield a better policy, tt' , we can then compute v n > and 
improve it again to yield an even better n". We can thus obtain a sequence of monotonically improving 
policies and value functions: 

TTO ~4 Vk 0 ~4 7Tl -4 V Vl -4 7r 2 -4 • • • -4 7T* -4 U*, 

where —4 denotes a policy evaluation and —4 denotes a policy improvement. Each policy is guaranteed 
to be a strict improvement over the previous one (unless it is already optimal). Because a finite MDP 
has only a finite number of policies, this process must converge to an optimal policy and optimal value 
function in a finite number of iterations. 

This way of finding an optimal policy is called policy iteration. A complete algorithm is given in 
the box on the next page. Note that each policy evaluation, itself an iterative computation, is started 
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Policy iteration (using iterative policy evaluation) 


1. Initialization 

P(s) € K. and n(s) € A(s ) arbitrarily for all s € § 

2. Policy Evaluation 
Repeat 

A <- 0 

For each s £ S: 
v 4—V(s) 

V { s ) t- E 5 ',rP( S, > r 'l S > 7r ( S ))[ r + 7^(s')] 

A 4— max(A, |w — E(s)|) 
until A <8 (a small positive number) 

3. Policy Improvement 
policy-stable 4— true 
For each s € S: 

old-action 4— 7 r(s) 

tt(s) 4- argmax a J2 s ^ r p{s' , r I s ) ®) [ r + 7 P(s')] 

If old-action ^ 7 r(s), then policy-stable 4 — false 
If policy-stable, then stop and return V « i>* and it ss 7 t*; else go to 2 


with the value function for the previous policy. This typically results in a great increase in the speed of 
convergence of policy evaluation (presumably because the value function changes little from one policy 
to the next). 

Policy iteration often converges in surprisingly few iterations. This is illustrated by the example 
in Figure 4.1. The bottom-left diagram shows the value function for the equiprobable random policy, 
and the bottom-right diagram shows a greedy policy for this value function. The policy improvement 
theorem assures us that these policies are better than the original random policy. In this case, however, 
these policies are not just better, but optimal, proceeding to the terminal states in the minimum number 
of steps. In this example, policy iteration would find the optimal policy after just one iteration. 

Exercise 4.4 The policy iteration algorithm given on this page has a subtle bug in that it may never 
terminate if the policy continually switches between two or more policies that are equally good. This is 
ok for pedagogy, but not for actual use. Modifiy the pseudocode so that convergence is guaranteed. □ 

Example 4.2: Jack’s Car Rental Jack manages two locations for a nationwide car rental company. 
Each day, some number of customers arrive at each location to rent cars. If Jack has a car available, he 
rents it out and is credited $10 by the national company. If he is out of cars at that location, then the 
business is lost. Cars become available for renting the day after they are returned. To help ensure that 
cars are available where they are needed, Jack can move them between the two locations overnight, at 
a cost of $2 per car moved. We assume that the number of cars requested and returned at each location 
are Poisson random variables, meaning that the probability that the number is n is yye -A , where A is 
the expected number. Suppose A is 3 and 4 for rental requests at the first and second locations and 
3 and 2 for returns. To simplify the problem slightly, we assume that there can be no more than 20 
cars at each location (any additional cars are returned to the nationwide company, and thus disappear 
from the problem) and a maximum of five cars can be moved from one location to the other in one 
night. We take the discount rate to be 7 = 0.9 and formulate this as a continuing finite MDP, where 
the time steps are days, the state is the number of cars at each location at the end of the day, and 
the actions are the net numbers of cars moved between the two locations overnight. Figure 4.2 shows 
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Figure 4.2: The sequence of policies found by policy iteration on Jack’s car rental problem, and the hnal 
state-value function. The first five diagrams show, for each number of cars at each location at the end of the 
day, the number of cars to be moved from the first location to the second (negative numbers indicate transfers 
from the second location to the first). Each successive policy is a strict improvement over the previous policy, 
and the last policy is optimal. 


the sequence of policies found by policy iteration starting from the policy that never moves any cars. 


Exercise 4.5 (programming) Write a program for policy iteration and re-solve Jack’s car rental 
problem with the following changes. One of Jack’s employees at the first location rides a bus home 
each night and lives near the second location. She is happy to shuttle one car to the second location 
for free. Each additional car still costs $2, as do all cars moved in the other direction. In addition, 
Jack has limited parking space at each location. If more than 10 cars are kept overnight at a location 
(after any moving of cars), then an additional cost of $4 must be incurred to use a second parking lot 
(independent of how many cars are kept there). These sorts of nonlinearities and arbitrary dynamics 
often occur in real problems and cannot easily be handled by optimization methods other than dynamic 
programming. To check your program, first replicate the results given for the original problem. If your 
computer is too slow for the full problem, cut all the numbers of cars in half. □ 

Exercise 4.6 How would policy iteration be defined for action values? Give a complete algorithm 
for computing < 7 *, analogous to that on page 65 for computing v*. Please pay special attention to this 
exercise, because the ideas involved will be used throughout the rest of the book. □ 

Exercise 4.7 Suppose you are restricted to considering only policies that are e-soft, meaning that 
the probability of selecting each action in each state, s, is at least e/|A(s)|. Describe qualitatively the 
changes that would be required in each of the steps 3, 2, and 1, in that order, of the policy iteration 
algorithm for u* (page 65). □ 
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4.4 Value Iteration 

One drawback to policy iteration is that each of its iterations involves policy evaluation, which may 
itself be a protracted iterative computation requiring multiple sweeps through the state set. If policy 
evaluation is done iteratively, then convergence exactly to v occurs only in the limit. Must we wait 
for exact convergence, or can we stop short of that? The example in Figure 4.1 certainly suggests that 
it may be possible to truncate policy evaluation. In that example, policy evaluation iterations beyond 
the first three have no effect on the corresponding greedy policy. 

In fact, the policy evaluation step of policy iteration can be truncated in several ways without losing 
the convergence guarantees of policy iteration. One important special case is when policy evaluation 
is stopped after just one sweep (one update of each state). This algorithm is called value iteration. It 
can be written as a particularly simple update operation that combines the policy improvement and 
truncated policy evaluation steps: 

Vk+i(s) = rnaxE[i?t + i + 7 iifc(<S)+i) | S t = s, A t = a] 

a 

= nraxN ^s^rls, a) r +'yVk(s') , (4.10) 

a ' L J 

s' ,r 

for all s £ §. For arbitrary vo, the sequence {vk} can be shown to converge to v# under the same 
conditions that guarantee the existence of v*. 

Another way of understanding value iteration is by reference to the Bellman optimality equation 
(4.1). Note that value iteration is obtained simply by turning the Bellman optimality equation into 
an update rule. Also note how the value iteration update is identical to the policy evaluation update 
(4.5) except that it requires the maximum to be taken over all actions. Another way of seeing this close 
relationship is to compare the backup diagrams for these algorithms on page 47 (policy evaluation) and 
on the left of Figure 3.5 (value iteration). These two are the natural backup operations for computing 
v n and v*. 

Finally, let us consider how value iteration terminates. Like policy evaluation, value iteration formally 
requires an infinite number of iterations to converge exactly to v*. In practice, we stop once the value 
function changes by only a small amount in a sweep. The box shows a complete algorithm with this 
kind of termination condition. 


Value iteration 


Initialize array V arbitrarily (e.g., V(s) = 0 for all s € S + ) 

Repeat 
A <- 0 

For each s € S: 
v <— V(s) 

V(s) <- max a T, s ^ r p(s',r\s,a)[r + 7U(s')] 

A 4 — max(A, |u — V(s)|) 
until A < 0 (a small positive number) 

Output a deterministic policy, tt ss 7r*, such that 
tt(s) = argmax Q X) s / ir .p(s , ,r|s,a)[r+ 7V(s')] 

Value iteration effectively combines, in each of its sweeps, one sweep of policy evaluation and one 
sweep of policy improvement. Faster convergence is often achieved by interposing multiple policy 
evaluation sweeps between each policy improvement sweep. In general, the entire class of truncated 
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policy iteration algorithms can be thought of as sequences of sweeps, some of which use policy evaluation 
updates and some of which use value iteration updates. Since the max operation in (4.10) is the only 
difference between these updates, this just means that the max operation is added to some sweeps of 
policy evaluation. All of these algorithms converge to an optimal policy for discounted finite MDPs. 

Example 4.3: Gambler’s Problem A gambler has the opportunity to make bets on the outcomes 
of a sequence of coin flips. If the coin comes up heads, he wins as many dollars as he has staked on 
that flip; if it is tails, he loses his stake. The game ends when the gambler wins by reaching his goal 
of $100, or loses by running out of money. On each flip, the gambler must decide what portion of his 
capital to stake, in integer numbers of dollars. This problem can be formulated as an undiscounted, 
episodic, finite MDP. The state is the gambler’s capital, s £ {1,2,..., 99} and the actions are stakes, 
a £ {0,1,..., min(s, 100 — s)}. The reward is zero on all transitions except those on which the gambler 
reaches his goal, when it is +1. The state-value function then gives the probability of winning from 
each state. A policy is a mapping from levels of capital to stakes. The optimal policy maximizes 
the probability of reaching the goal. Let ph denote the probability of the coin coming up heads. If 
Ph is known, then the entire problem is known and it can be solved, for instance, by value iteration. 
Figure 4.3 shows the change in the value function over successive sweeps of value iteration, and the final 
policy found, for the case of ph = 0.4. This policy is optimal, but not unique. In fact, there is a whole 
family of optimal policies, all corresponding to ties for the argmax action selection with respect to the 
optimal value function. Can you guess what the entire family looks like? 


Value 

estimates 



Final 

policy 

(stake) 



I 25 50 75 99 

Capital 


Figure 4.3: The solution to the gambler’s problem for ph = 0.4. The upper graph shows the value function 
found by successive sweeps of value iteration. The lower graph shows the final policy. ■ 


Exercise 4.8 Why does the optimal policy for the gambler’s problem have such a curious form? In 
particular, for capital of 50 it bets it all on one flip, but for capital of 51 it does not. Why is this a 
good policy? □ 
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Exercise 4.9 (programming) Implement value iteration for the gambler’s problem and solve it for 
Ph = 0.25 and ph = 0.55. In programming, you may find it convenient to introduce two dummy states 
corresponding to termination with capital of 0 and 100 , giving them values of 0 and 1 respectively. 
Show your results graphically, as in Figure 4.3. Are your results stable as 9 —> 0? □ 

Exercise 4.10 What is the analog of the value iteration update (4.10) for action values, qk+i(s, a)? □ 


4.5 Asynchronous Dynamic Programming 

A major drawback to the DP methods that we have discussed so far is that they involve operations 
over the entire state set of the MDP, that is, they require sweeps of the state set. If the state set is very 
large, then even a single sweep can be prohibitively expensive. For example, the game of backgammon 
has over 10 20 states. Even if we could perform the value iteration update on a million states per second, 
it would take over a thousand years to complete a single sweep. 

Asynchronous DP algorithms are in-place iterative DP algorithms that are not organized in terms 
of systematic sweeps of the state set. These algorithms update the values of states in any order 
whatsoever, using whatever values of other states happen to be available. The values of some states 
may be updated several times before the values of others are updated once. To converge correctly, 
however, an asynchronous algorithm must continue to update the values of all the states: it can’t ignore 
any state after some point in the computation. Asynchronous DP algorithms allow great flexibility in 
selecting states to update. 

For example, one version of asynchronous value iteration updates the value, in place, of only one 
state, Sk, on each step, k, using the value iteration update (4.10). If 0 < 7 < 1, asymptotic convergence 
to u* is guaranteed given only that all states occur in the sequence {s^} an infinite number of times 
(the sequence could even be stochastic). (In the undiscounted episodic case, it is possible that there are 
some orderings of updates that do not result in convergence, but it is relatively easy to avoid these.) 
Similarly, it is possible to intermix policy evaluation and value iteration updates to produce a kind 
of asynchronous truncated policy iteration. Although the details of this and other more unusual DP 
algorithms are beyond the scope of this book, it is clear that a few different updates form building 
blocks that can be used flexibly in a wide variety of sweepless DP algorithms. 

Of course, avoiding sweeps does not necessarily mean that we can get away with less computation. It 
just means that an algorithm does not need to get locked into any hopelessly long sweep before it can 
make progress improving a policy. We can try to take advantage of this flexibility by selecting the states 
to which we apply updates so as to improve the algorithm’s rate of progress. We can try to order the 
updates to let value information propagate from state to state in an efficient way. Some states may not 
need their values updated as often as others. We might even try to skip updating some states entirely 
if they are not relevant to optimal behavior. Some ideas for doing this are discussed in Chapter 8 . 

Asynchronous algorithms also make it easier to intermix computation with real-time interaction. To 
solve a given MDP, we can run an iterative DP algorithm at the same time that an agent is actually 
experiencing the MDP. The agent’s experience can be used to determine the states to which the DP 
algorithm applies its updates. At the same time, the latest value and policy information from the DP 
algorithm can guide the agent’s decision making. For example, we can apply updates to states as the 
agent visits them. This makes it possible to focus the DP algorithm’s updates onto parts of the state 
set that are most relevant to the agent. This kind of focusing is a repeated theme in reinforcement 
learning. 
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4.6 Generalized Policy Iteration 

Policy iteration consists of two simultaneous, interacting processes, one making the value function 
consistent with the current policy (policy evaluation), and the other making the policy greedy with 
respect to the current value function (policy improvement). In policy iteration, these two processes 
alternate, each completing before the other begins, but this is not really necessary. In value iteration, for 
example, only a single iteration of policy evaluation is performed in between each policy improvement. 
In asynchronous DP methods, the evaluation and improvement processes are interleaved at an even 
finer grain. In some cases a single state is updated in one process before returning to the other. As long 
as both processes continue to update all states, the ultimate result is typically the same—convergence 
to the optimal value function and an optimal policy. 

We use the term generalized policy iteration (GPI) to refer to the general 
idea of letting policy evaluation and policy improvement processes interact, 
independent of the granularity and other details of the two processes. Al¬ 
most all reinforcement learning methods are well described as GPI. That 
is, all have identifiable policies and value functions, with the policy always 

being improved with respect to the value function and the value function 7 T V 

always being driven toward the value function for the policy, as suggested 
by the diagram to the right. It is easy to see that if both the evaluation 
process and the improvement process stabilize, that is, no longer produce 
changes, then the value function and policy must be optimal. The value 
function stabilizes only when it is consistent with the current policy, and 
the policy stabilizes only when it is greedy with respect to the current value 
function. Thus, both processes stabilize only when a policy has been found 
that is greedy with respect to its own evaluation function. This implies 
that the Bellman optimality equation (4.1) holds, and thus that the policy 7T* 7 ' 

and the value function are optimal. 

The evaluation and improvement processes in GPI can be viewed as both 
competing and cooperating. They compete in the sense that they pull in opposing directions. Making 
the policy greedy with respect to the value function typically makes the value function incorrect for the 
changed policy, and making the value function consistent with the policy typically causes that policy 
no longer to be greedy. In the long run, however, these two processes interact to find a single joint 
solution: the optimal value function and an optimal policy. 

One might also think of the interaction between 
the evaluation and improvement processes in GPI 
in terms of two constraints or goals—for example, 
as two lines in two-dimensional space as suggested 
by the diagram to the right. Although the real ge¬ 
ometry is much more complicated than this, the 
diagram suggests what happens in the real case. 

Each process drives the value function or policy 
toward one of the lines representing a solution to 
one of the two goals. The goals interact because 
the two lines are not orthogonal. Driving directly 
toward one goal causes some movement away from 
the other goal. Inevitably, however, the joint pro¬ 
cess is brought closer to the overall goal of optimality. The arrows in this diagram correspond to the 
behavior of policy iteration in that each takes the system all the way to achieving one of the two goals 
completely. In GPI one could also take smaller, incomplete steps toward each goal. In either case, 
the two processes together achieve the overall goal of optimality even though neither is attempting to 




improvement 


evaluation 
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achieve it directly. 


4.7 Efficiency of Dynamic Programming 

DP may not be practical for very large problems, but compared with other methods for solving MDPs, 
DP methods are actually quite efficient. If we ignore a few technical details, then the (worst case) time 
DP methods take to find an optimal policy is polynomial in the number of states and actions. If n and k 
denote the number of states and actions, this means that a DP method takes a number of computational 
operations that is less than some polynomial function of n and k. A DP method is guaranteed to find an 
optimal policy in polynomial time even though the total number of (deterministic) policies is k n . In this 
sense, DP is exponentially faster than any direct search in policy space could be, because direct search 
would have to exhaustively examine each policy to provide the same guarantee. Linear programming 
methods can also be used to solve MDPs, and in some cases their worst-case convergence guarantees 
are better than those of DP methods. But linear programming methods become impractical at a much 
smaller number of states than do DP methods (by a factor of about 100). For the largest problems, 
only DP methods are feasible. 

DP is sometimes thought to be of limited applicability because of the curse of dimensionality , the fact 
that the number of states often grows exponentially with the number of state variables. Large state sets 
do create difficulties, but these are inherent difficulties of the problem, not of DP as a solution method. 
In fact, DP is comparatively better suited to handling large state spaces than competing methods such 
as direct search and linear programming. 

In practice, DP methods can be used with today’s computers to solve MDPs with millions of states. 
Both policy iteration and value iteration are widely used, and it is not clear which, if either, is better 
in general. In practice, these methods usually converge much faster than their theoretical worst-case 
run times, particularly if they are started with good initial value functions or policies. 

On problems with large state spaces, asynchronous DP methods are often preferred. To complete 
even one sweep of a synchronous method requires computation and memory for every state. For some 
problems, even this much memory and computation is impractical, yet the problem is still potentially 
solvable because relatively few states occur along optimal solution trajectories. Asynchronous methods 
and other variations of GPI can be applied in such cases and may find good or optimal policies much 
faster than synchronous methods can. 


4.8 Summary 

In this chapter we have become familiar with the basic ideas and algorithms of dynamic programming 
as they relate to solving finite MDPs. Policy evaluation refers to the (typically) iterative computation 
of the value functions for a given policy. Policy improvement refers to the computation of an improved 
policy given the value function for that policy. Putting these two computations together, we obtain 
policy iteration and value iteration, the two most popular DP methods. Either of these can be used to 
reliably compute optimal policies and value functions for finite MDPs given complete knowledge of the 
MDP. 

Classical DP methods operate in sweeps through the state set, performing an expected update op¬ 
eration on each state. Each such operation updates the value of one state based on the values of all 
possible successor states and their probabilities of occurring. Expected updates are closely related to 
Bellman equations: they are little more than these equations turned into assignment statements. When 
the updates no longer result in any changes in value, convergence has occurred to values that satisfy the 
corresponding Bellman equation. Just as there are four primary value functions (v w , i>*, q v , and q *), 
there are four corresponding Bellman equations and four corresponding expected updates. An intuitive 
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view of the operation of DP updates is given by their backup diagrams. 

Insight into DP methods and, in fact, into almost all reinforcement learning methods, can be gained by 
viewing them as generalized policy iteration (GPI). GPI is the general idea of two interacting processes 
revolving around an approximate policy and an approximate value function. One process takes the 
policy as given and performs some form of policy evaluation, changing the value function to be more 
like the true value function for the policy. The other process takes the value function as given and 
performs some form of policy improvement, changing the policy to make it better, assuming that the 
value function is its value function. Although each process changes the basis for the other, overall they 
work together to find a joint solution: a policy and value function that are unchanged by either process 
and, consequently, are optimal. In some cases, GPI can be proved to converge, most notably for the 
classical DP methods that we have presented in this chapter. In other cases convergence has not been 
proved, but still the idea of GPI improves our understanding of the methods. 

It is not necessary to perform DP methods in complete sweeps through the state set. Asynchronous 
DP methods are in-place iterative methods that update states in an arbitrary order, perhaps stochas¬ 
tically determined and using out-of-date information. Many of these methods can be viewed as fine¬ 
grained forms of GPI. 

Finally, we note one last special property of DP methods. All of them update estimates of the values 
of states based on estimates of the values of successor states. That is, they update estimates on the 
basis of other estimates. We call this general idea bootstrapping. Many reinforcement learning methods 
perform bootstrapping, even those that do not require, as DP requires, a complete and accurate model 
of the environment. In the next chapter we explore reinforcement learning methods that do not require a 
model and do not bootstrap. In the chapter after that we explore methods that do not require a model 
but do bootstrap. These key features and properties are separable, yet can be mixed in interesting 
combinations. 


Bibliographical and Historical Remarks 

The term “dynamic programming” is due to Bellman (1957a), who showed how these methods could be 
applied to a wide range of problems. Extensive treatments of DP can be found in many texts, including 
Bertsekas (2005, 2012), Bertsekas and Tsitsiklis (1996), Dreyfus and Law (1977), Ross (1983), White 
(1969), and Whittle (1982, 1983). Our interest in DP is restricted to its use in solving MDPs, but DP 
also applies to other types of problems. Kumar and Kanal (1988) provide a more general look at DP. 

To the best of our knowledge, the first connection between DP and reinforcement learning was made 
by Minsky (1961) in commenting on Samuel’s checkers player. In a footnote, Minsky mentioned that 
it is possible to apply DP to problems in which Samuel’s backing-up process can be handled in closed 
analytic form. This remark may have misled artificial intelligence researchers into believing that DP 
was restricted to analytically tractable problems and therefore largely irrelevant to artificial intelli¬ 
gence. Andreae (1969b) mentioned DP in the context of reinforcement learning, specifically policy 
iteration, although he did not make specific connections between DP and learning algorithms. Wer- 
bos (1977) suggested an approach to approximating DP called “heuristic dynamic programming” that 
emphasizes gradient-descent methods for continuous-state problems (Werbos, 1982, 1987, 1988, 1989, 
1992). These methods are closely related to the reinforcement learning algorithms that we discuss in 
this book. Watkins (1989) was explicit in connecting reinforcement learning to DP, characterizing a 
class of reinforcement learning methods as “incremental dynamic programming.” 

4.1—4 These sections describe well-established DP algorithms that are covered in any of the general 
DP references cited above. The policy improvement theorem and the policy iteration algorithm 
are due to Bellman (1957a) and Howard (1960). Our presentation was influenced by the local 
view of policy improvement taken by Watkins (1989). Our discussion of value iteration as a 
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form of truncated policy iteration is based on the approach of Puterman and Shin (1978), who 
presented a class of algorithms called modified policy iteration, which includes policy iteration 
and value iteration as special cases. An analysis showing how value iteration can be made to 
find an optimal policy in finite time is given by Bertsekas (1987). 

Iterative policy evaluation is an example of a classical successive approximation algorithm for 
solving a system of linear equations. The version of the algorithm that uses two arrays, one 
holding the old values while the other is updated, is often called a Jacobi-style algorithm, 
after Jacobi’s classical use of this method. It is also sometimes called a synchronous algorithm 
because the effect is as if all the values are updated at the same time. The second array is needed 
to simulate this parallel computation sequentially. The in-place version of the algorithm is often 
called a Gauss-Seidel-style algorithm after the classical Gauss-Seidel algorithm for solving 
systems of linear equations. In addition to iterative policy evaluation, other DP algorithms can 
be implemented in these different versions. Bertsekas and Tsitsiklis (1989) provide excellent 
coverage of these variations and their performance differences. 

4.5 Asynchronous DP algorithms are due to Bertsekas (1982, 1983), who also called them dis¬ 
tributed DP algorithms. The original motivation for asynchronous DP was its implementation 
on a multiprocessor system with communication delays between processors and no global syn¬ 
chronizing clock. These algorithms are extensively discussed by Bertsekas and Tsitsiklis (1989). 
Jacobi-style and Gauss-Seidel-style DP algorithms are special cases of the asynchronous ver¬ 
sion. Williams and Baird (1990) presented DP algorithms that are asynchronous at a finer 
grain than the ones we have discussed: the update operations themselves are broken into steps 
that can be performed asynchronously. 

4.7 This section, written with the help of Michael Liftman, is based on Liftman, Dean, and Kael- 
bling (1995). The phrase “curse of dimensionality” is due to Bellman (1957). 
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Chapter 5 


Monte Carlo Methods 


In this chapter we consider our first learning methods for estimating value functions and discovering 
optimal policies. Unlike the previous chapter, here we do not assume complete knowledge of the 
environment. Monte Carlo methods require only experience —sample sequences of states, actions, and 
rewards from actual or simulated interaction with an environment. Learning from actual experience 
is striking because it requires no prior knowledge of the environment’s dynamics, yet can still attain 
optimal behavior. Learning from simulated experience is also powerful. Although a model is required, 
the model need only generate sample transitions, not the complete probability distributions of all 
possible transitions that is required for dynamic programming (DP). In surprisingly many cases it is 
easy to generate experience sampled according to the desired probability distributions, but infeasible 
to obtain the distributions in explicit form. 

Monte Carlo methods are ways of solving the reinforcement learning problem based on averaging 
sample returns. To ensure that well-defined returns are available, here we define Monte Carlo methods 
only for episodic tasks. That is, we assume experience is divided into episodes, and that all episodes 
eventually terminate no matter what actions are selected. Only on the completion of an episode are 
value estimates and policies changed. Monte Carlo methods can thus be incremental in an episode-by¬ 
episode sense, but not in a step-by-step (online) sense. The term “Monte Carlo” is often used more 
broadly for any estimation method whose operation involves a significant random component. Here we 
use it specifically for methods based on averaging complete returns (as opposed to methods that learn 
from partial returns, considered in the next chapter). 

Monte Carlo methods sample and average returns for each state-action pair much like the bandit 
methods we explored in Chapter 2 sample and average rewards for each action. The main difference is 
that now there are multiple states, each acting like a different bandit problem (like an associative-search 
or contextual bandit) and the different bandit problems are interrelated. That is, the return after taking 
an action in one state depends on the actions taken in later states in the same episode. Because all the 
action selections are undergoing learning, the problem becomes nonstationary from the point of view 
of the earlier state. 

To handle the nonstationarity, we adapt the idea of general policy iteration (GPI) developed in 
Chapter 4 for DP. Whereas there we computed value functions from knowledge of the MDP, here 
we learn value functions from sample returns with the MDP. The value functions and corresponding 
policies still interact to attain optimality in essentially the same way (GPI). As in the DP chapter, first 
we consider the prediction problem (the computation of v w and q w for a fixed arbitrary policy n) then 
policy improvement, and, finally, the control problem and its solution by GPL Each of these ideas taken 
from DP is extended to the Monte Carlo case in which only sample experience is available. 
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5.1 Monte Carlo Prediction 

We begin by considering Monte Carlo methods for learning the state-value function for a given policy. 
Recall that the value of a state is the expected return—expected cumulative future discounted reward— 
starting from that state. An obvious way to estimate it from experience, then, is simply to average the 
returns observed after visits to that state. As more returns are observed, the average should converge 
to the expected value. This idea underlies all Monte Carlo methods. 

In particular, suppose we wish to estimate v v (s), the value of a state s under policy n, given a set 
of episodes obtained by following n and passing through s. Each occurrence of state s in an episode 
is called a visit to s. Of course, s may be visited multiple times in the same episode; let us call the 
first time it is visited in an episode the first visit to s. The first-visit MC method estimates v„(s) 
as the average of the returns following first visits to s, whereas the every-visit MC method averages 
the returns following all visits to s. These two Monte Carlo (MC) methods are very similar but have 
slightly different theoretical properties. First-visit MC has been most widely studied, dating back to the 
1940s, and is the one we focus on in this chapter. Every-visit MC extends more naturally to function 
approximation and eligibility traces, as discussed in Chapters 9 and 12. First-visit MC is shown in 
procedural form in the box. 


First-visit MC prediction, for estimating V ss 


Initialize: 

7 r <— policy to be evaluated 
V «— an arbitrary state-value function 
Returns(s) <— an empty list, for all s £ § 

Repeat forever: 

Generate an episode using 7r 

For each state s appearing in the episode: 

G <— the return that follows the first occurrence of s 
Append G to Returns(s) 

F(s) <— averag e(Returns(s)) 


Both first-visit MC and every-visit MC converge to v„(s) as the number of visits (or first visits) 
to s goes to infinity. This is easy to see for the case of first-visit MC. In this case each return is an 
independent, identically distributed estimate of v n (s) with finite variance. By the law of large numbers 
the sequence of averages of these estimates converges to their expected value. Each average is itself 
an unbiased estimate, and the standard deviation of its error falls as 1/ yfn, where n is the number of 
returns averaged. Every-visit MC is less straightforward, but its estimates also converge quadratically 
to Vn(s) (Singh and Sutton, 1996). 

The use of Monte Carlo methods is best illustrated through an example. 

Example 5.1: Blackjack The object of the popular casino card game of blackjack is to obtain cards 
the sum of whose numerical values is as great as possible without exceeding 21. All face cards count as 
10, and an ace can count either as 1 or as 11. We consider the version in which each player competes 
independently against the dealer. The game begins with two cards dealt to both dealer and player. One 
of the dealer’s cards is face up and the other is face down. If the player has 21 immediately (an ace and 
a 10-card), it is called a natural. He then wins unless the dealer also has a natural, in which case the 
game is a draw. If the player does not have a natural, then he can request additional cards, one by one 
(hits), until he either stops ( sticks ) or exceeds 21 ( goes bust). If he goes bust, he loses; if he sticks, then 
it becomes the dealer’s turn. The dealer hits or sticks according to a fixed strategy without choice: he 
sticks on any sum of 17 or greater, and hits otherwise. If the dealer goes bust, then the player wins; 
otherwise, the outcome—win, lose, or draw—is determined by whose final sum is closer to 21. 
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Playing blackjack is naturally formulated as an episodic finite MDP. Each game of blackjack is an 
episode. Rewards of +1, —1, and 0 are given for winning, losing, and drawing, respectively. All rewards 
within a game are zero, and we do not discount (7 = 1 ); therefore these terminal rewards are also the 
returns. The player’s actions are to hit or to stick. The states depend on the player’s cards and the 
dealer’s showing card. We assume that cards are dealt from an infinite deck (i.e., with replacement) so 
that there is no advantage to keeping track of the cards already dealt. If the player holds an ace that he 
could count as 11 without going bust, then the ace is said to be usable. In this case it is always counted 
as 11 because counting it as 1 would make the sum 11 or less, in which case there is no decision to be 
made because, obviously, the player should always hit. Thus, the player makes decisions on the basis 
of three variables: his current sum ( 12 - 21 ), the dealer’s one showing card (ace- 10 ), and whether or not 
he holds a usable ace. This makes for a total of 200 states. 

Consider the policy that sticks if the player’s sum is 20 or 21, and otherwise hits. To find the state- 
value function for this policy by a Monte Carlo approach, one simulates many blackjack games using 
the policy and averages the returns following each state. Note that in this task the same state never 
recurs within one episode, so there is no difference between first-visit and every-visit MC methods. In 
this way, we obtained the estimates of the state-value function shown in Figure 5.1. The estimates for 
states with a usable ace are less certain and less regular because these states are less common. In any 
event, after 500,000 games the value function is very well approximated. 

Although we have complete knowledge of the environment in this task, it would not be easy to apply 
DP methods to compute the value function. DP methods require the distribution of next events—in 
particular, they require the environments dynamics as given by the four-argument function p —and 
it is not easy to determine this for blackjack. For example, suppose the player’s sum is 14 and he 
chooses to stick. What is his probability of terminating with with a reward of +1 as a function of the 
dealer’s showing card? All of the probabilities must be computed before DP can be applied, and such 
computations are often complex and error-prone. In contrast, generating the sample games required by 
Monte Carlo methods is easy. This is the case surprisingly often; the ability of Monte Carlo methods to 
work with sample episodes alone can be a significant advantage even when one has complete knowledge 
of the environment’s dynamics. 


After 10,000 episodes 


After 500,000 episodes 


Usable 
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Figure 5.1: Approximate state-value functions for the blackjack policy that sticks only on 20 or 21, computed 
by Monte Carlo policy evaluation. ■ 
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Can we generalize the idea of backup diagrams to Monte Carlo algorithms? The general idea of 
an backup diagram is to show at the top the root node to be updated and to show below all the 
transitions and leaf nodes whose rewards and estimated values contribute to the update. For Monte 
Carlo estimation of v n , the root is a state node, and below it is the entire trajectory of transitions 
along a particular single episode, ending at the terminal state, as shown to the right. Whereas the DP 
diagram (page 47) shows all possible transitions, the Monte Carlo diagram shows only those sampled on 
the one episode. Whereas the DP diagram includes only one-step transitions, the Monte Carlo diagram 
goes all the way to the end of the episode. These differences in the diagrams accurately reflect the 
fundamental differences between the algorithms. 

An important fact about Monte Carlo methods is that the estimates for each state are inde¬ 
pendent. The estimate for one state does not build upon the estimate of any other state, as is 
the case in DP. In other words, Monte Carlo methods do not bootstrap as we defined it in the 
previous chapter. 

In particular, note that the computational expense of estimating the value of a single state is 
independent of the number of states. This can make Monte Carlo methods particularly attractive 
when one requires the value of only one or a subset of states. One can generate many sample 
episodes starting from the states of interest, averaging returns from only these states, ignoring 
all others. This is a third advantage Monte Carlo methods can have over DP methods (after 
the ability to learn from actual experience and from simulated experience). 

Example 5.2: Soap Bubble 

Suppose a wire frame forming a closed loop is dunked in soapy water to form a soap surface or 
bubble conforming at its edges to the wire frame. If the geometry of the wire frame is irregular but 
known, how can you compute the shape of the surface? The shape has the property that the total force 
on each point exerted by neighboring points is zero (or else the shape would change). This means that 
the surface’s height at any point is the average of its heights at points in a small circle around that 
point. In addition, the surface must meet at its boundaries with the wire frame. The usual approach 
to problems of this kind is to put a grid over the area covered by the surface and solve for its height 
at the grid points by an iterative computation. Grid points at the boundary are forced to the wire 
frame, and all others are adjusted toward the average of the heights of their four nearest neighbors. 
This process then iterates, much like DP’s iterative policy evaluation, and ultimately converges to a 
close approximation to the desired surface. 

This is similar to the kind of problem for which Monte 
Carlo methods were originally designed. Instead of the 
iterative computation described above, imagine stand¬ 
ing on the surface and taking a random walk, stepping 
randomly from grid point to neighboring grid point, 
with equal probability, until you reach the boundary. 

It turns out that the expected value of the height at 
the boundary is a close approximation to the height of 
the desired surface at the starting point (in fact, it is 
exactly the value computed by the iterative method de¬ 
scribed above). Thus, one can closely approximate the 
height of the surface at a point by simply averaging the 
boundary heights of many walks started at the point. If 
one is interested in only the value at one point, or any 
fixed small set of points, then this Monte Carlo method 
can be far more efficient than the iterative method based on local consistency. ■ 
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Exercise 5.1 Consider the diagrams on the right in Figure 5.1. Why does the estimated value function 
jump up for the last two rows in the rear? Why does it drop off for the whole last row on the left? Why 
are the frontmost values higher in the upper diagrams than in the lower? □ 


5.2 Monte Carlo Estimation of Action Values 

If a model is not available, then it is particularly useful to estimate action values (the values of state- 
action pairs) rather than state values. With a model, state values alone are sufficient to determine a 
policy; one simply looks ahead one step and chooses whichever action leads to the best combination of 
reward and next state, as we did in the chapter on DP. Without a model, however, state values alone 
are not sufficient. One must explicitly estimate the value of each action in order for the values to be 
useful in suggesting a policy. Thus, one of our primary goals for Monte Carlo methods is to estimate 
g*. To achieve this, we first consider the policy evaluation problem for action values. 

The policy evaluation problem for action values is to estimate q n (s,a), the expected return when 
starting in state s, taking action a, and thereafter following policy 7r. The Monte Carlo methods for 
this are essentially the same as just presented for state values, except now we talk about visits to a 
state-action pair rather than to a state. A state-action pair s,a is said to be visited in an episode if 
ever the state s is visited and action a is taken in it. The every-visit MC method estimates the value 
of a state-action pair as the average of the returns that have followed all the visits to it. The first-visit 
MC method averages the returns following the first time in each episode that the state was visited and 
the action was selected. These methods converge quadratically, as before, to the true expected values 
as the number of visits to each state-action pair approaches infinity. 

The only complication is that many state-action pairs may never be visited. If 7r is a deterministic 
policy, then in following 7r one will observe returns only for one of the actions from each state. With 
no returns to average, the Monte Carlo estimates of the other actions will not improve with experience. 
This is a serious problem because the purpose of learning action values is to help in choosing among 
the actions available in each state. To compare alternatives we need to estimate the value of all the 
actions from each state, not just the one we currently favor. 

This is the general problem of maintaining exploration , as discussed in the context of the fc-armed 
bandit problem in Chapter 2. For policy evaluation to work for action values, we must assure continual 
exploration. One way to do this is by specifying that the episodes start in a state-action pair , and that 
every pair has a nonzero probability of being selected as the start. This guarantees that all state-action 
pairs will be visited an infinite number of times in the limit of an infinite number of episodes. We call 
this the assumption of exploring starts. 

The assumption of exploring starts is sometimes useful, but of course it cannot be relied upon in 
general, particularly when learning directly from actual interaction with an environment. In that case 
the starting conditions are unlikely to be so helpful. The most common alternative approach to assuring 
that all state-action pairs are encountered is to consider only policies that are stochastic with a nonzero 
probability of selecting all actions in each state. We discuss two important variants of this approach in 
later sections. For now, we retain the assumption of exploring starts and complete the presentation of 
a full Monte Carlo control method. 

Exercise 5.2 What is the backup diagram for Monte Carlo estimation of g T ? □ 
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5.3 Monte Carlo Control 


We are now ready to consider how Monte Carlo estimation can be used in 
control, that is, to approximate optimal policies. The overall idea is to pro¬ 
ceed according to the same pattern as in the DP chapter, that is, according 
to the idea of generalized policy iteration (GPI). In CPI one maintains both 
an approximate policy and an approximate value function. The value func¬ 
tion is repeatedly altered to more closely approximate the value function for 
the current policy, and the policy is repeatedly improved with respect to the 
current value function, as suggested by the diagram to the right. These two 
kinds of changes work against each other to some extent, as each creates 
a moving target for the other, but together they cause both policy and value function to approach 
optimality. 

To begin, let us consider a Monte Carlo version of classical policy iteration. In this method, we 
perform alternating complete steps of policy evaluation and policy improvement, beginning with an 
arbitrary policy ttq and ending with the optimal policy and optimal action-value function: 

7T0 <hv 0 n l ^2 7T* q t , 

where —denotes a complete policy evaluation and —^A denotes a complete policy improvement. 
Policy evaluation is done exactly as described in the preceding section. Many episodes are experienced, 
with the approximate action-value function approaching the true function asymptotically. For the 
moment, let us assume that we do indeed observe an infinite number of episodes and that, in addition, 
the episodes are generated with exploring starts. Under these assumptions, the Monte Carlo methods 
will compute each q„ k exactly, for arbitrary 7 t*,. 

Policy improvement is done by making the policy greedy with respect to the current value function. 
In this case we have an action-value function, and therefore no model is needed to construct the greedy 
policy. For any action-value function q, the corresponding greedy policy is the one that, for each s £ S, 
deterministically chooses an action with maximal action-value: 

7r(s) = argmaxg(s, a). (5-1) 

a 

Policy improvement then can be done by constructing each 7Tfc+i as the greedy policy with respect to 
q„ h . The policy improvement theorem (Section 4.2) then applies to n k and Tr k +i because, for all s £ S, 


evaluation 



7T Q 

v^^greedy 

improvement 


Qn k (s, 7Tfc_|_i(s)) = 


> 

> 


q* k (s, argrnax q Kk (s, a)) 

a 

max^, (s, a) 

a 

qn k (s,TT k (s)) 

v„ k {s). 


As we discussed in the previous chapter, the theorem assures us that each Ttk+i is uniformly better than 
7Tfc, or just as good as tt/ c , in which case they are both optimal policies. This in turn assures us that the 
overall process converges to the optimal policy and optimal value function. In this way Monte Carlo 
methods can be used to find optimal policies given only sample episodes and no other knowledge of the 
environment’s dynamics. 

We made two unlikely assumptions above in order to easily obtain this guarantee of convergence for 
the Monte Carlo method. One was that the episodes have exploring starts, and the other was that 
policy evaluation could be done with an infinite number of episodes. To obtain a practical algorithm 
we will have to remove both assumptions. We postpone consideration of the first assumption until later 
in this chapter. 
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For now we focus on the assumption that policy evaluation operates on an infinite number of episodes. 
This assumption is relatively easy to remove. In fact, the same issue arises even in classical DP methods 
such as iterative policy evaluation, which also converge only asymptotically to the true value function. 
In both DP and Monte Carlo cases there are two ways to solve the problem. One is to hold firm to 
the idea of approximating q^ k in each policy evaluation. Measurements and assumptions are made to 
obtain bounds on the magnitude and probability of error in the estimates, and then sufficient steps are 
taken during each policy evaluation to assure that these bounds are sufficiently small. This approach 
can probably be made completely satisfactory in the sense of guaranteeing correct convergence up to 
some level of approximation. However, it is also likely to require far too many episodes to be useful in 
practice on any but the smallest problems. 

There is a second approach to avoiding the infinite number of episodes nominally required for policy 
evaluation, in which we give up trying to complete policy evaluation before returning to policy im¬ 
provement. On each evaluation step we move the value function toward q^ k , but we do not expect to 
actually get close except over many steps. We used this idea when we first introduced the idea of GPI 
in Section 4.6. One extreme form of the idea is value iteration, in which only one iteration of iterative 
policy evaluation is performed between each step of policy improvement. The in-place version of value 
iteration is even more extreme; there we alternate between improvement and evaluation steps for single 
states. 

For Monte Carlo policy evaluation it is natural to alternate between evaluation and improvement on 
an episode-by-episode basis. After each episode, the observed returns are used for policy evaluation, 
and then the policy is improved at all the states visited in the episode. A complete simple algorithm 
along these lines, which we call Monte Carlo ES , for Monte Carlo with Exploring Starts, is given in the 
box. 


Monte Carlo ES (Exploring Starts), for estimating 7r ss 7r* 


Initialize, for all s £ S, a £ A(s): 

Q(s, a) «— arbitrary 
7r(s) <— arbitrary 
Ret.urns(s, a) <— empty list 

Repeat forever: 

Choose So £ S and Ao £ A(S'o) s.t. all pairs have probability > 0 
Generate an episode starting from So, Ao , following n 
For each pair s, a appearing in the episode: 

G <— the return that follows the first occurrence of s, a 
Append G to Returns(s, a) 

Q(s,a) <r- averag e(Returns(s, a)) 

For each s in the episode: 
n(s) <— argmax 0 Q(s, a) 


In Monte Carlo ES, all the returns for each state-action pair are accumulated and averaged, irre¬ 
spective of what policy was in force when they were observed. It is easy to see that Monte Carlo ES 
cannot converge to any suboptimal policy. If it did, then the value function would eventually converge 
to the value function for that policy, and that in turn would cause the policy to change. Stability is 
achieved only when both the policy and the value function are optimal. Convergence to this optimal 
fixed point seems inevitable as the changes to the action-value function decrease over time, but has not 
yet been formally proved. In our opinion, this is one of the most fundamental open theoretical questions 
in reinforcement learning (for a partial solution, see Tsitsiklis, 2002). 
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Example 5.3: Solving Blackjack It is straightforward to apply Monte Carlo ES to blackjack. 
Since the episodes are all simulated games, it is easy to arrange for exploring starts that include all 
possibilities. In this case one simply picks the dealer’s cards, the player’s sum, and whether or not the 
player has a usable ace, all at random with equal probability. As the initial policy we use the policy 
evaluated in the previous blackjack example, that which sticks only on 20 or 21. The initial action-value 
function can be zero for all state-action pairs. Figure 5.2 shows the optimal policy for blackjack found 
by Monte Carlo ES. This policy is the same as the “basic” strategy of Thorp (1966) with the sole 
exception of the leftmost notch in the policy for a usable ace, which is not present in Thorp’s strategy. 
We are uncertain of the reason for this discrepancy, but confident that what is shown here is indeed the 
optimal policy for the version of blackjack we have described. 


Usable 

ace 


No 

usable 

ace 


K* V* 



Figure 5.2: The optimal policy and state-value function for blackjack, found by Monte Carlo ES (Figure 
5.4). The state-value function shown was computed from the action-value function found by Monte Carlo ES. 


5.4 Monte Carlo Control without Exploring Starts 

How can we avoid the unlikely assumption of exploring starts? The only general way to ensure that all 
actions are selected infinitely often is for the agent to continue to select them. There are two approaches 
to ensuring this, resulting in what we call on-policy methods and off-policy methods. On-policy methods 
attempt to evaluate or improve the policy that is used to make decisions, whereas off-policy methods 
evaluate or improve a policy different from that used to generate the data. The Monte Carlo ES method 
developed above is an example of an on-policy method. In this section we show how an on-policy Monte 
Carlo control method can be designed that does not use the unrealistic assumption of exploring starts. 
Off-policy methods are considered in the next section. 

In on-policy control methods the policy is generally soft , meaning that 7r(a|s) > 0 for all s £ § and 
all a € A(s), but gradually shifted closer and closer to a deterministic optimal policy. Many of the 
methods discussed in Chapter 2 provide mechanisms for this. The on-policy method we present in this 
section uses e-greedy policies, meaning that most of the time they choose an action that has maximal 
estimated action value, but with probability e they instead select an action at random. That is, all 
nongreedy actions are given the minimal probability of selection, and the remaining bulk of the 
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probability, 1 — e + is given to the greedy action. The e-greedy policies are examples of e-soft 

policies, defined as policies for which 7r(a|s) > pyjyy for all states and actions, for some £ > 0. Among 
e-soft policies, £-greedy policies are in some sense those that are closest to greedy. 

The overall idea of on-policy Monte Carlo control is still that of GPL As in Monte Carlo ES, we 
use first-visit MC methods to estimate the action-value function for the current policy. Without the 
assumption of exploring starts, however, we cannot simply improve the policy by making it greedy 
with respect to the current value function, because that would prevent further exploration of nongreedy 
actions. Fortunately, GPI does not require that the policy be taken all the way to a greedy policy, only 
that it be moved toward a greedy policy. In our on-policy method we will move it only to an £-greedy 
policy. For any £-soft policy, 7r, any £-greedy policy with respect to q n is guaranteed to be better than 
or equal to 7r. The complete algorithm is given in the box below. 


On-policy first-visit MC control (for £-soft policies), estimates 7 r ss 7 r* 


Initialize, for all s £ S, a £ A(s): 

Q(s, a) ■£- arbitrary 
Returns(s, a) <— empty list 
7r(a|s) <— an arbitrary e-soft policy 


Repeat forever: 

(a) Generate an episode using 7r 

(b) For each pair s, a appearing in the episode: 

G «— the return that follows the first occurrence of s, a 
Append G to Returns(s,a) 

Q(s,a) <— averag e(Returns(s, a)) 

(c) For each s in the episode: 

A* <— argmaxa Q(s, a) (with ties broken arbitrarily) 

For all a £ A(s): 

1 — e + e/\A(s)\ if a = A* 
e/\A{s)\ if a ^ A* 


7r(a|s) 


That any £-greedy policy with respect to q n is an improvement over any e-soft policy 7r is assured 
by the policy improvement theorem. Let ir' be the e-greedy policy. The conditions of the policy 
improvement theorem apply because for any s £ S: 

q n (s,ir'(s)) = y^7r / (a|s)g^.(s,q) 

a 

= p^X>< s ’»> 

s PM !>(*•<■) 

(the sum is a weighted average with nonnegative weights summing to 1, and as such it must be less 
than or equal to the largest number averaged) 

= mTUT ~ M7WT + H 7r ( a l s )^( s ’ a ) 

' ' '' a ' ' '' a a 

= 1 ’tt(s). 

Thus, by the policy improvement theorem, n' > ir (i.e., tv(s) > v n (s), for all s £ §). We now prove 
that equality can hold only when both 7r ; and 7r are optimal among the e-soft policies, that is, when 
they are better than or equal to all other e-soft policies. 


+ (1 — e) max< 7 ^(s, o) 


(5.2) 


__ 7r(a|s) — | A.. 

+ ( 1-£ )X] -1-7--<?7r(s,a) 


1 — £ 
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Consider a new environment that is just like the original environment, except with the requirement 
that policies be e-soft “moved inside” the environment. The new environment has the same action and 
state set as the original and behaves as follows. If in state s and taking action a , then with probability 
1 — e the new environment behaves exactly like the old environment. With probability e it repicks the 
action at random, with equal probabilities, and then behaves like the old environment with the new, 
random action. The best one can do in this new environment with general policies is the same as the 
best one could do in the original environment with e-soft policies. Let v„ and g* denote the optimal 
value functions for the new environment. Then a policy tt is optimal among e-soft policies if and only 
if v n = v*. From the definition of u* we know that it is the unique solution to 


u*(s) = (1 — e) maxg*(s, a) + 


w>il 

= (1 — elmax) p(s',r\s,a) r + 'jv*(s') 

a zJ L 

s' ,r 


a s' ,r 


When equality holds and the e-soft policy 7r is no longer improved, then we also know, from (5.2), that 


v 1 r(s) = (1 — e) maxq r (s, a) + 


= (1 — e) max^^p(s / , r \ s, a) IV + 


+ ^£E^a4 + 7^') 


However, this equation is the same as the previous one, except for the substitution of v n for I?*. Since 
u* is the unique solution, it must be that v^ = tf*. 

In essence, we have shown in the last few pages that policy iteration works for e-soft policies. Using 
the natural notion of greedy policy for e-soft policies, one is assured of improvement on every step, 
except when the best policy has been found among the e-soft policies. This analysis is independent of 
how the action-value functions are determined at each stage, but it does assume that they are computed 
exactly. This brings us to roughly the same point as in the previous section. Now we only achieve 
the best policy among the e-soft policies, but on the other hand, we have eliminated the assumption of 
exploring starts. 


5.5 Off-policy Prediction via Importance Sampling 

All learning control methods face a dilemma: They seek to learn action values conditional on sub¬ 
sequent optimal behavior, but they need to behave non-optimally in order to explore all actions (to 
find the optimal actions). How can they learn about the optimal policy while behaving according to 
an exploratory policy? The on-policy approach in the preceding section is actually a compromise—it 
learns action values not for the optimal policy, but for a near-optimal policy that still explores. A more 
straightforward approach is to use two policies, one that is learned about and that becomes the optimal 
policy, and one that is more exploratory and is used to generate behavior. The policy being learned 
about is called the target policy, and the policy used to generate behavior is called the behavior policy. 
In this case we say that learning is from data “off’ the target policy, and the overall process is termed 
off-policy learning. 
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Throughout the rest of this book we consider both on-policy and off-policy methods. On-policy 
methods are generally simpler and are considered first. Off-policy methods require additional concepts 
and notation, and because the data is due to a different policy, off-policy methods are often of greater 
variance and are slower to converge. On the other hand, off-policy methods are more powerful and 
general. They include on-policy methods as the special case in which the target and behavior policies 
are the same. Off-policy methods also have a variety of additional uses in applications. For example, 
they can often be applied to learn from data generated by a conventional non-learning controller, or 
from a human expert. Off-policy learning is also seen by some as key to learning multi-step predictive 
models of the world’s dynamics (Sutton, 2009, Sutton et ah, 2011). 

In this section we begin the study of off-policy methods by considering the prediction problem, in 
which both target and behavior policies are fixed. That is, suppose we wish to estimate v„ or q n , but 
all we have are episodes following another policy b, where b ^ n. In this case, it is the target policy, b 
is the behavior policy, and both policies are considered fixed and given. 

In order to use episodes from b to estimate values for 7r, we require that every action taken under 
7r is also taken, at least occasionally, under b. That is, we require that 7r(a|s) > 0 implies b(a\s) > 
0. This is called the assumption of coverage. It follows from coverage that b must be stochastic in 
states where it is not identical to 7r. The target policy 7r, on the other hand, may be deterministic, 
and, in fact, this is a case of particular interest in control problems. In control, the target policy is 
typically the deterministic greedy policy with respect to the current action-value function estimate. 
This policy becomes a deterministic optimal policy while the behavior policy remains stochastic and 
more exploratory, for example, an ^-greedy policy. In this section, however, we consider the prediction 
problem, in which n is unchanging and given. 

Almost all off-policy methods utilize importance sampling, a general technique for estimating expected 
values under one distribution given samples from another. We apply importance sampling to off-policy 
learning by weighting returns according to the relative probability of their trajectories occurring under 
the target and behavior policies, called the importance-sampling ratio. Given a starting state St, the 
probability of the subsequent state-action trajectory, A t , S t+ -\. A t+ 1 ,..., St, occurring under any policy 
7T is 


Pr{A t , St+i, A t - |-i,..., St \ St, A t: T-i ~ tt} 

= Tr(A t \St)p(S t+1 \ S t , A t )ir(A t+1 \S t+1 ) ■ ■ -p(S T \S T -i, A T -i) 

T—l 

= Tr(A k \S k )p{S k +i l-Sfc, A k ), 

k—t 

where p here is the state-transition probability function defined by (3.4). Thus, the relative probability 
of the trajectory under the target and behavior policies (the importance-sampling ratio) is 

^ YllZl n{A k \S k )p(Sk+i | S k , A k ) = pj- 1 ir{A k \S k ) ( , 

Pt ' T ^ U T kZtKAk\S k )p{S k+1 \S k ,A k ) 11 

Although the trajectory probabilities depend on the MDP’s transition probabilities, which are generally 
unknown, they appear identically in both the numerator and denominator, and thus cancel. The 
importance sampling ratio ends up depending only on the two policies and the sequence, not on the 
MDP. 

Now we are ready to give a Monte Carlo algorithm that uses a batch of observed episodes following 
policy b to estimate v v (s). It is convenient here to number time steps in a way that increases across 
episode boundaries. That is, if the first episode of the batch ends in a terminal state at time 100, then 
the next episode begins at time t = 101. This enables us to use time-step numbers to refer to particular 
steps in particular episodes. In particular, we can define the set of all time steps in which state s 
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is visited, denoted T(s). This is for an every-visit method; for a first-visit method, T(s) would only 
include time steps that were first visits to s within their episodes. Also, let T(t) denote the first time 
of termination following time t, and Gt denote the return after t. up through T(t). Then {Gj} te 'j( s ) are 
the returns that pertain to state s, and {pp.T(t)~ i}tgT(s) are ^ le corresponding importance-sampling 
ratios. To estimate tv(s), we simply scale the returns by the ratios and average the results: 

Jr , ^ ■ T,te7(a) 

v,s> =-fjp)]-■ (5A> 

When importance sampling is done as a simple average in this way it is called ordinary importance 
sampling. 

An important alternative is weighted importance sampling , which uses a weighted average, defined as 


R(s) 


Hte7(s) Pt-.T(t)-iG t 
X)teT(s) Pt:T(t)~ 1 


(5.5) 


or zero if the denominator is zero. To understand these two varieties of importance sampling, consider 
their estimates after observing a single return. In the weighted-average estimate, the ratio Pt-.T(t)-i 
for the single return cancels in the numerator and denominator, so that the estimate is equal to the 
observed return independent of the ratio (assuming the ratio is nonzero). Given that this return was the 
only one observed, this is a reasonable estimate, but its expectation is Vb(s) rather than t)„-(s), and in 
this statistical sense it is biased. In contrast, the simple average (5.4) is always v n (s) in expectation (it 
is unbiased), but it can be extreme. Suppose the ratio were ten, indicating that the trajectory observed 
is ten times as likely under the target policy as under the behavior policy. In this case the ordinary 
importance-sampling estimate would be ten times the observed return. That is, it would be quite far 
from the observed return even though the episode’s trajectory is considered very representative of the 
target policy. 

Formally, the difference between the two kinds of importance sampling is expressed in their biases and 
variances. The ordinary importance-sampling estimator is unbiased whereas the weighted importance¬ 
sampling estimator is biased (the bias converges asymptotically to zero). On the other hand, the 
variance of the ordinary importance-sampling estimator is in general unbounded because the variance 
of the ratios can be unbounded, whereas in the weighted estimator the largest weight on any single 
return is one. In fact, assuming bounded returns, the variance of the weighted importance-sampling 
estimator converges to zero even if the variance of the ratios themselves is infinite (Precup, Sutton, 
and Dasgupta 2001). In practice, the weighted estimator usually has dramatically lower variance and 
is strongly preferred. Nevertheless, we will not totally abandon ordinary importance sampling as it 
is easier to extend to the approximate methods using function approximation that we explore in the 
second part of this book. 

A complete every-visit MC algorithm for off-policy policy evaluation using weighted importance 
sampling is given in the next section on page 90. 

Example 5.4: Off-policy Estimation of a Blackjack State Value 

We applied both ordinary and weighted importance-sampling methods to estimate the value of a single 
blackjack state from off-policy data. Recall that one of the advantages of Monte Carlo methods is that 
they can be used to evaluate a single state without forming estimates for any other states. In this 
example, we evaluated the state in which the dealer is showing a deuce, the sum of the player’s cards is 
13, and the player has a usable ace (that is, the player holds an ace and a deuce, or equivalently three 
aces). The data was generated by starting in this state then choosing to hit or stick at random with 
equal probability (the behavior policy). The target policy was to stick only on a sum of 20 or 21, as 
in Example 5.1. The value of this state under the target policy is approximately —0.27726 (this was 
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determined by separately generating one-hundred million episodes using the target policy and averaging 
their returns). Both off-policy methods closely approximated this value after 1000 off-policy episodes 
using the random policy. To make sure they did this reliably, we performed 100 independent runs, each 
starting from estimates of zero and learning for 10,000 episodes. Figure 5.3 shows the resultant learning 
curves—the squared error of the estimates of each method as a function of number of episodes, averaged 
over the 100 runs. The error approaches zero for both algorithms, but the weighted importance-sampling 
method has much lower error at the beginning, as is typical in practice. 



Episodes (log scale) 

Figure 5.3: Weighted importance sampling produces lower error estimates of the value of a single blackjack 
state from off-policy episodes (see Example 5.4). ■ 

Example 5.5: Infinite Variance The estimates of ordinary importance sampling will typically have 
infinite variance, and thus unsatisfactory convergence properties, whenever the scaled returns have 
infinite variance—and this can easily happen in off-policy learning when trajectories contain loops. A 
simple example is shown inset in Figure 5.4. There is only one nonterminal state s and two actions, 
right and left. The right action causes a deterministic transition to termination, whereas the left action 
transitions, with probability 0.9, back to s or, with probability 0.1, on to termination. The rewards are 
+1 on the latter transition and otherwise zero. Consider the target policy that always selects left. All 
episodes under this policy consist of some number (possibly zero) of transitions back to s followed by 
termination with a reward and return of +1. Thus the value of s under the target policy is 1 (7 = 1). 
Suppose we are estimating this value from off-policy data using the behavior policy that selects right 
and left with equal probability. 

The lower part of Figure 5.4 shows ten independent runs of the first-visit MC algorithm using ordinary 
importance sampling. Even after millions of episodes, the estimates fail to converge to the correct value 
of 1. In contrast, the weighted importance-sampling algorithm would give an estimate of exactly 1 
forever after the first episode that ended with the left action. All returns not equal to 1 (that is, ending 
with the right action) would be inconsistent with the target policy and thus would have a p t: T(t )-1 
of zero and contribute neither to the numerator nor denominator of (5.5). The weighted importance¬ 
sampling algorithm produces a weighted average of only the returns consistent with the target policy, 
and all of these would be exactly 1 . 

We can verify that the variance of the importance-sampling-scaled returns is infinite in this example 
by a simple calculation. The variance of any random variable X is the expected value of the deviation 
from its mean A', which can be written 

(X - A ) 2 


VarjX] = E 


E [X 2 - 2XX + X 2 ] = E [X 2 ] - X 2 . 
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Figure 5.4: Ordinary importance sampling produces surprisingly unstable estimates on the one-state MDP 
shown inset (Example 5.5). The correct estimate here is 1 (7 = 1), and, even though this is the expected value 
of a sample return (after importance sampling), the variance of the samples is infinite, and the estimates do not 
convergence to this value. These results are for off-policy first-visit MC. 


Thus, if the mean is finite, as it is in our case, the variance is infinite if and only if the expectation of 
the square of the random variable is infinite. Thus, we need only show that the expected square of the 
importance-sampling-scaled return is infinite: 


Eb 



AM St) r \ 


To compute this expectation, we break it down into cases based on episode length and termination. 
First note that, for any episode ending with the right action, the importance sampling ratio is zero, 
because the target policy would never take this action; these episodes thus contribute nothing to the 
expectation (the quantity in parenthesis will be zero) and can be ignored. We need only consider episodes 
that involve some number (possibly zero) of left actions that transition back to the nonterminal state, 
followed by a left action transitioning to termination. All of these episodes have a return of 1, so the 
G 0 factor can be ignored. To get the expected square we need only consider each length of episode, 
multiplying the probability of the episode’s occurrence by the square of its importance-sampling ratio, 
and add these up: 
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Exercise 5.3 What is the equation analogous to (5.5) for action values Q(s,a) instead of state values 
E(s), again given returns generated using 6 ? □ 

Exercise 5.4 In learning curves such as those shown in Figure 5.3 error generally decreases with 
training, as indeed happened for the ordinary importance-sampling method. But for the weighted 
importance-sampling method error first increased and then decreased. Why do you think this happened? 
□ 

Exercise 5.5 The results with Example 5.5 and shown in Figure 5.4 used a first-visit MC method. 
Suppose that instead an every-visit MC method was used on the same problem. Would the variance of 
the estimator still be infinite? Why or why not? □ 


5.6 Incremental Implementation 


Monte Carlo prediction methods can be implemented incrementally, on an episode-by-episode basis, 
using extensions of the techniques described in Chapter 2 (Section 2.4). Whereas in Chapter 2 we 
averaged rewards , in Monte Carlo methods we average returns. In all other respects exactly the same 
methods as used in Chapter 2 can be used for on-policy Monte Carlo methods. For off-policy Monte 
Carlo methods, we need to separately consider those that use ordinary importance sampling and those 
that use weighted importance sampling. 

In ordinary importance sampling, the returns are scaled by the importance sampling ratio pt-.T(t)-i 
(5.3), then simply averaged. For these methods we can again use the incremental methods of Chapter 2, 
but using the scaled returns in place of the rewards of that chapter. This leaves the case of off-policy 
methods using weighted importance sampling. Here we have to form a weighted average of the returns, 
and a slightly different incremental algorithm is required. 

Suppose we have a sequence of returns Gi, G 2 , ■ ■ ■, G n _ 1 , all starting in the same state and each with 
a corresponding random weight IT) (e.g., W t = Pt-.T(t)-i)- We wish to form the estimate 


^ w k G k 
YJk=\ w k 


n > 2, 


(5.6) 


and keep it up-to-date as we obtain a single additional return G n . In addition to keeping track of V n , 
we must maintain for each state the cumulative sum C n of the weights given to the first n returns. The 
update rule for V n is 


V n+1 = V n + ^ 


G n - V n 


n > 1, 


(5.7) 


and 


C n -(-1 — C n + W n + 1 , 

where Co == 0 (and V\ is arbitrary and thus need not be specified). The box on the next page contains 
a complete episode-by-episode incremental algorithm for Monte Carlo policy evaluation. The algorithm 
is nominally for the off-policy case, using weighted importance sampling, but applies as well to the 
on-policy case just by choosing the target and behavior policies as the same (in which case ( 7 r = b), 
W is always 1). The approximation Q converges to q n (for all encountered state-action pairs) while 
actions are selected according to a potentially different policy, b. 

Exercise 5.6 Modify the algorithm for first-visit MC policy evaluation (Section 5.1) to use the incre¬ 
mental implementation for sample averages described in Section 2.4. □ 

Exercise 5.7 Derive the weighted-average update rule (5.7) from (5.6). Follow the pattern of the 
derivation of the unweighted rule (2.3). □ 
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Off-policy MC prediction, for estimating Q as q K 


Input: an arbitrary target policy n 

Initialize, for all s £ S, a £ A(s): 
Q(s, a) <— arbitrary 
C(s, a) <— 0 


Repeat forever: 

b any policy with coverage of tt 
Generate an episode using b: 

So, Ao, Ri,.. . , St-i, At-i, Rt, St 
G <- 0 
W <- 1 

For t = T — 1, T — 2,... down to 0: 

G <— 7 G + R t +i 

C(S t , At) <— C(St, A t ) + W 

Q(St,A t ) «- Q(St,A t ) + [G-Q(St,A t )} 

W •<— w/RdiAt) 

If W = 0 then exit For loop 


5.7 Off-policy Monte Carlo Control 

We are now ready to present an example of the second class of learning control methods we consider in 
this book: off-policy methods. Recall that the distinguishing feature of on-policy methods is that they 
estimate the value of a policy while using it for control. In off-policy methods these two functions are 
separated. The policy used to generate behavior, called the behavior policy, may in fact be unrelated 
to the policy that is evaluated and improved, called the target policy. An advantage of this separation 
is that the target policy may be deterministic (e.g., greedy), while the behavior policy can continue to 
sample all possible actions. 

Off-policy Monte Carlo control methods use one of the techniques presented in the preceding two 
sections. They follow the behavior policy while learning about and improving the target policy. These 
techniques require that the behavior policy has a nonzero probability of selecting all actions that might 
be selected by the target policy (coverage). To explore all possibilities, we require that the behavior 
policy be soft (i.e., that it select all actions in all states with nonzero probability). 

The box on the next page shows an off-policy Monte Carlo control method, based on GPI and 
weighted importance sampling, for estimating 7 r* and < 7 *. The target policy it « 7 r* is the greedy policy 
with respect to Q, which is an estimate of q The behavior policy b can be anything, but in order to 
assure convergence of tt to the optimal policy, an infinite number of returns must be obtained for each 
pair of state and action. This can be assured by choosing b to be e-soft. The policy tt converges to 
optimal at all encountered states even though actions are selected according to a different soft policy b, 
which may change between or even within episodes. 

A potential problem is that this method learns only from the tails of episodes, when all of the 
remaining actions in the episode are greedy. If nongreedy actions are common, then learning will be slow, 
particularly for states appearing in the early portions of long episodes. Potentially, this could greatly 
slow learning. There has been insufficient experience with off-policy Monte Carlo methods to assess how 
serious this problem is. If it is serious, the most important way to address it is probably by incorporating 
temporal-difference learning, the algorithmic idea developed in the next chapter. Alternatively, if 7 is 
less than 1 , then the idea developed in the next section may also help significantly. 
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Exercise 5.8: Racetrack (programming) Consider driving a race car around a turn like those 
shown in Figure 5.5. You want to go as fast as possible, but not so fast as to run off the track. In our 
simplified racetrack, the car is at one of a discrete set of grid positions, the cells in the diagram. The 
velocity is also discrete, a number of grid cells moved horizontally and vertically per time step. The 
actions are increments to the velocity components. Each may be changed by +1, —1, or 0 in one step, 
for a total of nine actions. Both velocity components are restricted to be nonnegative and less than 5, 
and they cannot both be zero except at the starting line. Each episode begins in one of the randomly 
selected start states with both velocity components zero and ends when the car crosses the finish line. 
The rewards are —1 for each step until the car crosses the finish line. If the car hits the track boundary, 
it is moved back to a random position on the starting line, both velocity components are reduced to 
zero, and the episode continues. Before updating the car’s location at each time step, check to see if 
the projected path of the car intersects the track boundary. If it intersects the finish line, the episode 



Finish 

line 


Starting line 



Finish 

line 


Figure 5.5: A couple of right turns for the racetrack task. 
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ends; if it intersects anywhere else, the car is considered to have hit the track boundary and is sent 
back to the starting line. To make the task more challenging, with probability 0.1 at each time step 
the velocity increments are both zero, independently of the intended increments. Apply a Monte Carlo 
control method to this task to compute the optimal policy from each starting state. Exhibit several 
trajectories following the optimal policy (but turn the noise off for these trajectories). □ 


5.8 * Disco unting-aware Importance Sampling 

The off-policy methods that we have considered so far are based on forming importance-sampling weights 
for returns considered as unitary wholes, without taking into account the returns’ internal structures as 
sums of discounted rewards. We now briefly consider cutting-edge research ideas for using this structure 
to significantly reduce the variance of off-policy estimators. 

For example, consider the case where episodes are long and 7 is significantly less than 1. For concrete¬ 
ness, say that episodes last 100 steps and that 7 = 0 . The return from time 0 will then be just Go = Ri, 

but its importance sampling ratio will be a product of 100 factors, l(Aojso) b(Aijs^) ''' b(Aggjsgg) • 
ordinary importance sampling, the return will be scaled by the entire product, but it is really only 
necessary to scale by the first factor, by b(A°|So) ' ^he °^ ier 99 factors are irrele¬ 

vant because after the first reward the return has already been determined. These later factors are all 
independent of the return and of expected value 1 ; they do not change the expected update, but they 
add enormously to its variance. In some cases they could even make the variance infinite. Let us now 
consider an idea for avoiding this large extraneous variance. 

The essence of the idea is to think of discounting as determining a probability of termination or, 
equivalently, a degree of partial termination. For any 7 € [0,1), we can think of the return Go as partly 
terminating in one step, to the degree 1 — 7 , producing a return of just the first reward, Ri, and as 
partly terminating after two steps, to the degree (1 — 7 ) 7 , producing a return of R\ +R 2 , and so on. The 
latter degree corresponds to terminating on the second step, 1 — 7 , and not having already terminated 
on the first step, 7 . The degree of termination on the third step is thus (1 — 7 ) 7 2 , with the 7 2 reflecting 
that termination did not occur on either of the first two steps. The partial returns here are called flat 
partial returns: 

Gt-h = Rt+i + Rt+ 2 + • • • + Rk, 0 < t < h < T, 

where “flat” denotes the absence of discounting, and “partial” denotes that these returns do not extend 
all the way to termination but instead stop at h, called the horizon (and T is the time of termination of 
the episode). The conventional full return G t can be viewed as a sum of flat partial returns as suggested 
above as follows: 

Gt = Rt+i + ifRt+2 + 7 2 Rt +3 + • • • + 7 T t x Rt 

= (1 - i)Rt+i 
+ (l - 7)7 (Rt+i + Rt+ 2) 

+ (i - 7b 2 (Rt +1 + Rt +2 + Rt+3) 


+ (1 — 7b T * 2 (-Rt+i + Rt+ 2 + • • • + Rt- 1 ) 
+ 7 T 4 1 (Rt+i + Rt+ 2 + • • • + Rt) 

= {l-^)Y J l h ~ t ~ 1 G t:h + r y T ~ t ~ 1 G t :T- 

/l— 1 ~\~ 1 
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Now we need to scale the flat partial returns by an importance sampling ratio that is similarly 
truncated. As Gf.h only involves rewards up to a horizon h, we only need the ratio of the probabilities 
up to h. We define an ordinary importance-sampling estimator, analogous to (5.4), as 

VM = - 1 -prwf- z ' < 5 - 8 > 

and a weighted importance-sampling estimator, analogous to (5.5), as 

Z)ieT(s) (i 1 ~ l)I2h=t+l 'Y h - t ~ 1 pt:h- 1 Gt:h + 7 T ^~ t ~ 1 pt:T(t)-lG t :T(t)) 

V{s) =-^- 7 -—-r-(5.9) 

+ 7 T «- 4 -VcT( t) -1) 

We call these two estimators discounting-aware importance sampling estimators. They take into account 
the discount rate but have no effect (are the same as the off-policy estimators from Section 5.5) if 7 = 1. 


5.9 *Per-reward Importance Sampling 

There is one more way in which the structure of the return as a sum of rewards can be taken into account 
in off-policy importance sampling, a way that may be able to reduce variance even in the absence of 
discounting (that is, even if 7 = 1). In the off-policy estimators (5.4) and (5.5), each term of the sum 
in the numerator is itself a sum: 


Pt:T-\G t = Pt-.T-l (Rt+1 + lRt+2 4-+7 T * 1 Rt) 

= Pt-.T-lRt + l + lPt-.T-\Rt+2 H-+ 7 T * 1 Pt-.T-lR-T- 


(5.10) 


The off-policy estimators rely on the expected values of these terms; let us see if we can write them 
in a simpler way. Note that each sub-term of (5.10) is a product of a random reward and a random 
importance-sampling ratio. For example, the first sub-term can be written, using (5.3), as 

_ Tr(A t \S t ) Tr(A t+1 \S t+1 ) n(A t+2 \St + 2 ) n(A T -i\S T -i) „ 

Pt ' T_1 t+1 ~ b(A t \S t ) b(A t+1 \S t+1 ) b(A t+2 \S t+2 ) 6(A t _ 1 |5 t _ 1 ) t+1 ‘ 

Now notice that, of all these factors, only the first and the last (the reward) are correlated; all the other 
ratios are independent random variables whose expected value is one: 


Tr(A k \S k )' 

b(A k \S k )_ 




(5.11) 


Thus, because the expectation of the product of independent random variables is the product of their 
expectations, all the ratios except the first drop out in expectation, leaving just 


E[Pt:T-li?i+l] = E[p t :tRt+l] ■ 


If we repeat this analysis for the fcth term of (5.10), we get 
E[pt:T-lRt+k ] = E[p t: t+k-lRt+k] ■ 

It follows then that the expectation of our original term (5.10) can be written 


E[pt:T-lGt\ = E 


G t 
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where 


Gt = Pt.:t.Rt+ 1 + 1Pt:t+lRt+2 + 7 2 P*:t+2-Rt+3 + ' ' ' + * 1 Pf.T-iRT- 


We call this idea per-reward importance sampling. R follows immediately that there is an alter¬ 
nate importance-sampling estimator, with the same unbiased expectation as the ordinary-importance- 
sampling estimator (5.4), using G t : 


R(s) 


E 


teT(s) Gt 

PM 


(5.12) 


which we might expect to sometimes be of lower variance. 

Is there a per-reward version of weighted importance sampling? This is less clear. So far, all the 
estimators that have been proposed for this that we know of are not consistent (that is, they do not 
converge to the true value with infinite data). 

*Exercise 5.9 Modify the algorithm for off-policy Monte Carlo control (page 91) to use the idea of the 
truncated weighted-average estimator (5.9). Note that you will first need to convert this equation to 
action values. □ 


5.10 Summary 

The Monte Carlo methods presented in this chapter learn value functions and optimal policies from 
experience in the form of sample episodes. This gives them at least three kinds of advantages over 
DP methods. First, they can be used to learn optimal behavior directly from interaction with the 
environment, with no model of the environment’s dynamics. Second, they can be used with simulation 
or sample models. For surprisingly many applications it is easy to simulate sample episodes even though 
it is difficult to construct the kind of explicit model of transition probabilities required by DP methods. 
Third, it is easy and efficient to focus Monte Carlo methods on a small subset of the states. A region of 
special interest can be accurately evaluated without going to the expense of accurately evaluating the 
rest of the state set (we explore this further in Chapter 8). 

A fourth advantage of Monte Carlo methods, which we discuss later in the book, is that they may 
be less harmed by violations of the Markov property. This is because they do not update their value 
estimates on the basis of the value estimates of successor states. In other words, it is because they do 
not bootstrap. 

In designing Monte Carlo control methods we have followed the overall schema of generalized policy 
iteration (GPI) introduced in Chapter 4. GPI involves interacting processes of policy evaluation and 
policy improvement. Monte Carlo methods provide an alternative policy evaluation process. Rather 
than use a model to compute the value of each state, they simply average many returns that start in the 
state. Because a state’s value is the expected return, this average can become a good approximation to 
the value. In control methods we are particularly interested in approximating action-value functions, 
because these can be used to improve the policy without requiring a model of the environment’s tran¬ 
sition dynamics. Monte Carlo methods intermix policy evaluation and policy improvement steps on an 
episode-by-episode basis, and can be incrementally implemented on an episode-by-episode basis. 

Maintaining sufficient exploration is an issue in Monte Carlo control methods. It is not enough 
just to select the actions currently estimated to be best, because then no returns will be obtained for 
alternative actions, and it may never be learned that they are actually better. One approach is to ignore 
this problem by assuming that episodes begin with state-action pairs randomly selected to cover all 
possibilities. Such exploring starts can sometimes be arranged in applications with simulated episodes, 
but are unlikely in learning from real experience. In on-policy methods, the agent commits to always 
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exploring and tries to find the best policy that still explores. In off-policy methods, the agent also 
explores, but learns a deterministic optimal policy that may be unrelated to the policy followed. 

Off-policy prediction refers to learning the value function of a target policy from data generated by a 
different behavior policy. Such learning methods are based on some form of importance sampling , that 
is, on weighting returns by the ratio of the probabilities of taking the observed actions under the two 
policies. Ordinary importance sampling uses a simple average of the weighted returns, whereas weighted 
importance sampling uses a weighted average. Ordinary importance sampling produces unbiased es¬ 
timates, but has larger, possibly infinite, variance, whereas weighted importance sampling always has 
finite variance and is preferred in practice. Despite their conceptual simplicity, off-policy Monte Carlo 
methods for both prediction and control remain unsettled and are a subject of ongoing research. 

The Monte Carlo methods treated in this chapter differ from the DP methods treated in the previous 
chapter in two major ways. First, they operate on sample experience, and thus can be used for direct 
learning without a model. Second, they do not bootstrap. That is, they do not update their value 
estimates on the basis of other value estimates. These two differences are not tightly linked, and can 
be separated. In the next chapter we consider methods that learn from experience, like Monte Carlo 
methods, but also bootstrap, like DP methods. 


Bibliographical and Historical Remarks 

The term “Monte Carlo” dates from the 1940s, when physicists at Los Alamos devised games of chance 

that they could study to help understand complex physical phenomena relating to the atom bomb. 

Coverage of Monte Carlo methods in this sense can be found in several textbooks (e.g., Kalos and 

Whitlock, 1986; Rubinstein, 1981). 

5.1—2 Singh and Sutton (1996) distinguished between every-visit and first-visit MC methods and 
proved results relating these methods to reinforcement learning algorithms. The blackjack 
example is based on an example used by Widrow, Gupta, and Maitra (1973). The soap bubble 
example is a classical Dirichlet problem whose Monte Carlo solution was first proposed by 
Kakutani (1945; see Hersh and Griego, 1969; Doyle and Snell, 1984). 

Barto and Duff (1994) discussed policy evaluation in the context of classical Monte Carlo 
algorithms for solving systems of linear equations. They used the analysis of Curtiss (1954) to 
point out the computational advantages of Monte Carlo policy evaluation for large problems. 

5.3—4 Monte Carlo ES was introduced in the 1998 edition of this book. That may have been the 
first explicit connection between Monte Carlo estimation and control methods based on policy 
iteration. An early use of Monte Carlo methods to estimate action values in a reinforcement 
learning context was by Michie and Chambers (1968). In pole balancing (page 44), they used 
averages of episode durations to assess the worth (expected balancing “life”) of each possible 
action in each state, and then used these assessments to control action selections. Their method 
is similar in spirit to Monte Carlo ES with every-visit MC estimates. Narendra and Wheeler 
(1986) studied a Monte Carlo method for ergodic finite Markov chains that used the return 
accumulated between successive visits to the same state as a reward for adjusting a learning 
automaton’s action probabilities. 

5.5 Efficient off-policy learning has become recognized as an important challenge that arises in 
several fields. For example, it is closely related to the idea of “interventions” and “counterfac- 
tuals” in probabalistic graphical (Bayesian) models (e.g., Pearl, 1995; Balke and Pearl, 1994). 
Off-policy methods using importance sampling have a long history and yet still are not well 
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understood. Weighted importance sampling, which is also sometimes called normalized impor¬ 
tance sampling (e.g., Roller and Friedman, 2009), is discussed by Rubinstein (1981), Hesterberg 
(1988), Shelton (2001), and Liu (2001) among others. 

The target policy in off-policy learning is sometimes referred to in the literature as the “esti¬ 
mation” policy, as it was in the first edition of this book. 

5.7 The racetrack exercise is adapted from Barto, Bradtke, and Singh (1995), and from Gardner 
(1973). 

5.8 Our treatment of the idea of discounting-aware importance sampling is based on the analysis 
of Sutton, Mahmood, Precup, and van Hasselt (2014). It has been worked out most fully to 
date by Mahmood (in preparation; Mahmood, van Hasselt, and Sutton, 2014). 

5.9 Per-reward importance sampling was introduced by Precup, Sutton, and Singh (2000), who 
called it “per-decision” importance sampling. These works also combine off-policy learning 
with temporal-difference learning, eligibility traces, and approximation methods, introducing 
subtle issues that we consider in later chapters. 



Chapter 6 


Temporal-Difference Learning 


If one had to identify one idea as central and novel to reinforcement learning, it would undoubtedly be 
temporal-difference (TD) learning. TD learning is a combination of Monte Carlo ideas and dynamic 
programming (DP) ideas. Like Monte Carlo methods, TD methods can learn directly from raw expe¬ 
rience without a model of the environment’s dynamics. Like DP, TD methods update estimates based 
in part on other learned estimates, without waiting for a final outcome (they bootstrap). The relation¬ 
ship between TD, DP, and Monte Carlo methods is a recurring theme in the theory of reinforcement 
learning; this chapter is the beginning of our exploration of it. Before we are done, we will see that 
these ideas and methods blend into each other and can be combined in many ways. In particular, in 
Chapter 7 we introduce n-step algorithms, which provide a bridge from TD to Monte Carlo methods, 
and in Chapter 12 we introduce the TD(A) algorithm, which seamlessly unifies them. 

As usual, we start by focusing on the policy evaluation or prediction problem, the problem of esti¬ 
mating the value function v„ for a given policy 7r. For the control problem (finding an optimal policy), 
DP, TD, and Monte Carlo methods all use some variation of generalized policy iteration (GPI). The 
differences in the methods are primarily differences in their approaches to the prediction problem. 


6.1 TD Prediction 


Both TD and Monte Carlo methods use experience to solve the prediction problem. Given some 
experience following a policy 7r, both methods update their estimate V of v v for the nonterminal states 
St occurring in that experience. Roughly speaking, Monte Carlo methods wait until the return following 
the visit is known, then use that return as a target for V(St). A simple every-visit Monte Carlo method 
suitable for nonstationary environments is 


V(S t ) <- V(S t ) + aG t - V(S t ) 


( 6 . 1 ) 


where Gt is the actual return following time t, and a is a constant step-size parameter (c.f., Equation 
2.4). Let us call this method constant-a MC. Whereas Monte Carlo methods must wait until the end 
of the episode to determine the increment to V(S t ) (only then is G t known), TD methods need to wait 
only until the next time step. At time t + 1 they immediately form a target and make a useful update 
using the observed reward R t + 1 and the estimate V(S t+ i). The simplest TD method makes the update 


V(St) V(St) + a Rt+i + 7T (St+i) — V(St) 


( 6 . 2 ) 


immediately on transition to St+i and receiving Rt+i- In effect, the target for the Monte Carlo update 
is G 4 , whereas the target for the TD update is R t +\ + 7 V(S t+ i). This TD method is called TD(0), or 
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one-step TD , because it is a special case of the TD(A) and n-step TD methods developed in Chapter 12 
and Chapter 7. The box below specifies TD(0) completely in procedural form. 


Tabular TD(O) for estimating v n 


Input: the policy 7r to be evaluated 
Initialize E(s) arbitrarily (e.g., E(s) 

Repeat (for each episode): 

Initialize S 

Repeat (for each step of episode): 

A <— action given by n for S 
Take action A, observe R, S' 

V(S) <- V{S) + a[R + ~/V(S') 

S<- S' 

until S is terminal 

Because TD(0) bases its update in part on an existing estimate, we say that it is a bootstrapping 
method, like DP. We know from Chapter 3 that 

v w (s) = E n [G t | S t = s] (6.3) 

= E w [i? t+ i + yGt+i | S t = s} (from (3.9)) 

= +7u ff (5 t+ i) | S t = s]. (6.4) 

Roughly speaking, Monte Carlo methods use an estimate of (6.3) as a target, whereas DP methods use 
an estimate of (6.4) as a target. The Monte Carlo target is an estimate because the expected value 
in (6.3) is not known; a sample return is used in place of the real expected return. The DP target 
is an estimate not because of the expected values, which are assumed to be completely provided by a 
model of the environment, but because u w (iSt+i) is not known and the current estimate, R(S't+i), is 
used instead. The TD target is an estimate for both reasons: it samples the expected values in (6.4) 
and it uses the current estimate V instead of the true v n . Thus, TD methods combine the sampling of 
Monte Carlo with the bootstrapping of DP. As we shall see, with care and imagination this can take us 
a long way toward obtaining the advantages of both Monte Carlo and DP methods. 

Shown to the right is the backup diagram for tabular TD(0). The value estimate for 
the state node at the top of the backup diagram is updated on the basis of the one sample 
transition from it to the immediately following state. We refer to TD and Monte Carlo 
updates as sample updates because they involve looking ahead to a sample successor state 
(or state-action pair), using the value of the successor and the reward along the way to 
compute a backed-up value, and then updating the value of the original state (or state- 
action pair) accordingly. Sample updates differ from the expected updates of DP methods TD(0) 
in that they are based on a single sample successor rather than on a complete distribution 
of all possible successors. 

Finally, note that the quantity in brackets in the TD(0) update is a sort of error, measuring the 
difference between the estimated value of St and the better estimate Rt+i + 7 V(iSt+i). This quantity, 
called the TD error , arises in various forms throughout reinforcement learning: 

<5* = Rt+i + jV(S t+1 ) - V{St). (6.5) 

Notice that the TD error at each time is the error in the estimate made at that time. Because the TD 
error depends on the next state and next reward, it is not actually available until one time step later. 
That is, St is the error in V(St), available at time t + 1. Also note that if the array V does not change 
during the episode (as it does not in Monte Carlo methods), then the Monte Carlo error can be written 


? 

: 


= 0, for all s € S + ) 


- V{S)] 
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as a sum of TD errors: 

Gt — V(St) = Rt+i + 7 Gt.+i — V(St) + jV(S t +i) — jV(St+ 1 ) (from (3.9)) 

= <5 t + 7 (G t+1 -U(S t+1 )) 

= S t + 'ySt+i + 7 2 (G t+2 - V(S t +2 )) 

= S t + 7 <5( + i + 7 2 <5t+2 + • • • + 7 T * 1 <5 t-i+ 7 T t (Gr — V(ST)) 

= S t + 'ySt+i + 7 2 <f«+2 H-1- 7 T ' t_1 (5T-i + 7 T_i (0 - 0) 

T—l 

= £ 7 fe -‘4. (6.6) 

k—t 

This identity is not exact if V is updated during the episode (as it is in TD(0)), but if the step size is 
small then it may still hold approximately. Generalizations of this identity play an important role in 
the theory and algorithms of temporal-difference learning. 

Exercise 6.1 If V changes during the episode, then (6.6) only holds approximately; what would the 
difference be between the two sides? Let V) denote the array of state values used at time t in the TD 
error (6.5) and in the TD update (6.2). Redo the derivation above to determine the additional amount 
that must be added to the sum of TD errors in order to equal the Monte Carlo error. □ 

Example 6.1: Driving Home Each day as you drive home from work, you try to predict how long 
it will take to get home. When you leave your office, you note the time, the day of week, the weather, 
and anything else that might be relevant. Say on this Friday you are leaving at exactly 6 o’clock, and 
you estimate that it will take 30 minutes to get home. As you reach your car it is 6:05, and you notice 
it is starting to rain. Traffic is often slower in the rain, so you reestimate that it will take 35 minutes 
from then, or a total of 40 minutes. Fifteen minutes later you have completed the highway portion of 
your journey in good time. As you exit onto a secondary road you cut your estimate of total travel 
time to 35 minutes. Unfortunately, at this point you get stuck behind a slow truck, and the road is too 
narrow to pass. You end up having to follow the truck until you turn onto the side street where you 
live at 6:40. Three minutes later you are home. The sequence of states, times, and predictions is thus 
as follows: 

Elapsed Time Predicted Predicted 
State (minutes) Time to Go Total Time 


leaving office, friday at 6 

0 

30 

30 

reach car, raining 

5 

35 

40 

exiting highway 

20 

15 

35 

2ndary road, behind truck 

30 

10 

40 

entering home street 

40 

3 

43 

arrive home 

43 

0 

43 


The rewards in this example are the elapsed times on each leg of the journey. 1 We are not discounting 
(j = 1), and thus the return for each state is the actual time to go from that state. The value of each 
state is the expected time to go. The second column of numbers gives the current estimated value for 
each state encountered. 

A simple way to view the operation of Monte Carlo methods is to plot the predicted total time (the 
last column) over the sequence, as in Figure 6.1 (left). The arrows show the changes in predictions 
recommended by the constant-cc MC method (6.1), for a = 1. These are exactly the errors between 

1 If this were a control problem with the objective of minimizing travel time, then we would of course make the rewards 
the negative of the elapsed time. But since we are concerned here only with prediction (policy evaluation), we can keep 
things simple by using positive numbers. 
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Predicted 

total 

travel 

time 




leaving reach exiting 2ndary home arrive 
office car highway road street home 


Situation 


Situation 


Figure 6.1: Changes recommended in the driving home example by Monte Carlo methods (left) and TD 
methods (right). 


the estimated value (predicted time to go) in each state and the actual return (actual time to go). For 
example, when you exited the highway you thought it would take only 15 minutes more to get home, 
but in fact it took 23 minutes. Equation 6.1 applies at this point and determines an increment in the 
estimate of time to go after exiting the highway. The error, G t — V(S t ), at this time is eight minutes. 
Suppose the step-size parameter, a, is 1/2. Then the predicted time to go after exiting the highway 
would be revised upward by four minutes as a result of this experience. This is probably too large a 
change in this case; the truck was probably just an unlucky break. In any event, the change can only be 
made off-line, that is, after you have reached home. Only at this point do you know any of the actual 
returns. 

Is it necessary to wait until the final outcome is known before learning can begin? Suppose on another 
day you again estimate when leaving your office that it will take 30 minutes to drive home, but then 
you become stuck in a massive traffic jam. Twenty-five minutes after leaving the office you are still 
bumper-to-bumper on the highway. You now estimate that it will take another 25 minutes to get home, 
for a total of 50 minutes. As you wait in traffic, you already know that your initial estimate of 30 
minutes was too optimistic. Must you wait until you get home before increasing your estimate for the 
initial state? According to the Monte Carlo approach you must, because you don’t yet know the true 
return. 

According to a TD approach, on the other hand, you would learn immediately, shifting your initial 
estimate from 30 minutes toward 50. In fact, each estimate would be shifted toward the estimate that 
immediately follows it. Returning to our first day of driving, Figure 6.1 (right) shows the changes in 
the predictions recommended by the TD rule (6.2) (these are the changes made by the rule if a = 1). 
Each error is proportional to the change over time of the prediction, that is, to the temporal differences 
in predictions. 

Besides giving you something to do while waiting in traffic, there are several computational reasons 
why it is advantageous to learn based on your current predictions rather than waiting until termination 
when you know the actual return. We briefly discuss some of these in the next section. ■ 

Exercise 6.2 This is an exercise to help develop your intuition about why TD methods are often more 
efficient than Monte Carlo methods. Consider the driving home example and how it is addressed by 
TD and Monte Carlo methods. Can you imagine a scenario in which a TD update would be better on 
average than a Monte Carlo update? Give an example scenario—a description of past experience and 
a current state—in which you would expect the TD update to be better. Here’s a hint: Suppose you 
have lots of experience driving home from work. Then you move to a new building and a new parking 
lot (but you still enter the highway at the same place). Now you are starting to learn predictions for 
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the new building. Can you see why TD updates are likely to be much better, at least initially, in this 
case? Might the same sort of thing happen in the original task? □ 


6.2 Advantages of TD Prediction Methods 

TD methods update their estimates based in part on other estimates. They learn a guess from a guess— 
they bootstrap. Is this a good thing to do? What advantages do TD methods have over Monte Carlo 
and DP methods? Developing and answering such questions will take the rest of this book and more. 
In this section we briefly anticipate some of the answers. 

Obviously, TD methods have an advantage over DP methods in that they do not require a model of 
the environment, of its reward and next-state probability distributions. 

The next most obvious advantage of TD methods over Monte Carlo methods is that they are naturally 
implemented in an on-line, fully incremental fashion. With Monte Carlo methods one must wait until 
the end of an episode, because only then is the return known, whereas with TD methods one need wait 
only one time step. Surprisingly often this turns out to be a critical consideration. Some applications 
have very long episodes, so that delaying all learning until the end of the episode is too slow. Other 
applications are continuing tasks and have no episodes at all. Finally, as we noted in the previous 
chapter, some Monte Carlo methods must ignore or discount episodes on which experimental actions 
are taken, which can greatly slow learning. TD methods are much less susceptible to these problems 
because they learn from each transition regardless of what subsequent actions are taken. 

But are TD methods sound? Certainly it is convenient to learn one guess from the next, without 
waiting for an actual outcome, but can we still guarantee convergence to the correct answer? Happily, 
the answer is yes. For any fixed policy 7r, TD(0) has been proved to converge to v in the mean for a 
constant step-size parameter if it is sufficiently small, and with probability 1 if the step-size parameter 
decreases according to the usual stochastic approximation conditions (2.7). Most convergence proofs 
apply only to the table-based case of the algorithm presented above (6.2), but some also apply to the 
case of general linear function approximation. These results are discussed in a more general setting in 
Chapter 9. 

If both TD and Monte Carlo methods converge asymptotically to the correct predictions, then a 
natural next question is “Which gets there first?” In other words, which method learns faster? Which 
makes the more efficient use of limited data? At the current time this is an open question in the sense 
that no one has been able to prove mathematically that one method converges faster than the other. In 
fact, it is not even clear what is the most appropriate formal way to phrase this question! In practice, 
however, TD methods have usually been found to converge faster than constant-a MC methods on 
stochastic tasks, as illustrated in Example 6.2. 

Exercise 6.3 From the results shown in the left graph of the random walk example (on the next page) 
it appears that the first episode results in a change in only V(A). What does this tell you about what 
happened on the first episode? Why was only the estimate for this one state changed? By exactly how 
much was it changed? □ 

Exercise 6.4 The specific results shown in the right graph of the random walk example are dependent 
on the value of the step-size parameter, a. Do you think the conclusions about which algorithm is 
better would be affected if a wider range of a values were used? Is there a different, fixed value of a at 
which either algorithm would have performed significantly better than shown? Why or why not? □ 

“Exercise 6.5 In the right graph of the random walk example, the RMS error of the TD method seems 
to go down and then up again, particularly at high a’s. What could have caused this? Do you think 
this always occurs, or might it be a function of how the approximate value function was initialized? □ 
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Example 6.2 Random Walk 


In this example we empirically compare the prediction abilities of TD(0) and constant-a MC 
when applied to the following Markov reward process: 



start 


A Markov reward process , or MRP, is a Markov decision process without actions. We will often 
use MRPs when focusing on the prediction problem, in which there is no need to distinguish 
the dynamics due to the environment from those due to the agent. In this MRP, all episodes 
start in the center state, C, then proceed either left or right by one state on each step, with 
equal probability. Episodes terminate either on the extreme left or the extreme right. When an 
episode terminates on the right, a reward of +1 occurs; all other rewards are zero. For example, a 
typical episode might consist of the following state-and-reward sequence: C, 0, B, 0, C, 0, D, 0, E, 1. 
Because this task is undiscounted, the true value of each state is the probability of terminating 
on the right if starting from that state. Thus, the true value of the center state is v„(C) = 0.5. 
The true values of all the states, A through E, are and 




The left graph above shows the values learned after various numbers of episodes on a single 
run of TD(0). The estimates after 100 episodes are about as close as they ever come to the 
true values—with a constant step-size parameter (a = 0.1 in this example), the values fluctuate 
indefinitely in response to the outcomes of the most recent episodes. The right graph shows 
learning curves for the two methods for various values of a. The performance measure shown 
is the root mean-squared (RMS) error between the value function learned and the true value 
function, averaged over the five states, then averaged over 100 runs. In all cases the approximate 
value function was initialized to the intermediate value V(s) = 0.5, for all s. The TD method 
was consistently better than the MC method on this task. 


Exercise 6.6 In Example 6.2 we stated that the true values for the random walk example are |, |, |, 
and |, for states A through E. Describe at least two different ways that these could have been computed. 
Which would you guess we actually used? Why? □ 
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6.3 Optimality of TD(0) 

Suppose there is available only a finite amount of experience, say 10 episodes or 100 time steps. In this 
case, a common approach with incremental learning methods is to present the experience repeatedly 
until the method converges upon an answer. Given an approximate value function, V, the increments 
specified by (6.1) or (6.2) are computed for every time step t at which a nonterminal state is visited, 
but the value function is changed only once, by the sum of all the increments. Then all the available 
experience is processed again with the new value function to produce a new overall increment, and so 
on, until the value function converges. We call this batch updating because updates are made only after 
processing each complete batch of training data. 

Under batch updating, TD(0) converges deterministically to a single answer independent of the step- 
size parameter, a, as long as a is chosen to be sufficiently small. The constant-a MC method also 
converges deterministically under the same conditions, but to a different answer. Understanding these 
two answers will help us understand the difference between the two methods. Under normal updating 
the methods do not move all the way to their respective batch answers, but in some sense they take 
steps in these directions. Before trying to understand the two answers in general, for all possible tasks, 
we first look at a few examples. 

Example 6.3: Random walk under batch updating Batch-updating versions of TD(0) and 
constant-a MC were applied as follows to the random walk prediction example (Example 6.2). After 
each new episode, all episodes seen so far were treated as a batch. They were repeatedly presented to the 
algorithm, either TD(0) or constant-a MC, with a sufficiently small that the value function converged. 
The resulting value function was then compared with v n , and the average root mean-squared error 
across the five states (and across 100 independent repetitions of the whole experiment) was plotted to 
obtain the learning curves shown in Figure 6.2. Note that the batch TD method was consistently better 
than the batch Monte Carlo method. 



Figure 6.2: Performance of TD(0) and constant-a MC under batch training on the random walk task.B 

Under batch training, constant-a MC converges to values, U(s), that are sample averages of the 
actual returns experienced after visiting each state s. These are optimal estimates in the sense that 
they minimize the mean-squared error from the actual returns in the training set. In this sense it is 
surprising that the batch TD method was able to perform better according to the root mean-squared 
error measure shown in Figure 6.2. How is it that batch TD was able to perform better than this 
optimal method? The answer is that the Monte Carlo method is optimal only in a limited way, and 
that TD is optimal in a way that is more relevant to predicting returns. But first let’s develop our 
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intuitions about different kinds of optimality through another example. Consider Example 6.4. 


Example 6.4 You are the Predictor 


Place yourself now in the role of the predictor of returns for an unknown Markov reward 
process. Suppose you observe the following eight episodes: 


A, 0, B, 0 

B, 1 

B, 1 

B, 1 

B, 1 

B, 1 

B, 1 

B, 0 


This means that the first episode started in state A, transitioned to B with a reward of 0, 
and then terminated from B with a reward of 0. The other seven episodes were even shorter, 
starting from B and terminating immediately. Given this batch of data, what would you say 
are the optimal predictions, the best values for the estimates V(A) and V(B)? Everyone would 
probably agree that the optimal value for V(B) is |, because six out of the eight times in state B 
the process terminated immediately with a return of 1, and the other two times in B the process 
terminated immediately with a return of 0. 

But what is the optimal value for the estimate V(A) given this 
data? Here there are two reasonable answers. One is to observe 
that 100% of the times the process was in state A it traversed 
immediately to B (with a reward of 0); and since we have already 
decided that B has value |, therefore A must have value | as 
well. One way of viewing this answer is that it is based on first 
modeling the Markov process, in this case as shown to the right, 
and then computing the correct estimates given the model, which 
indeed in this case gives V(A) = |. This is also the answer that 
batch TD(0) gives. 

The other reasonable answer is simply to observe that we have seen A once and the return 
that followed it was 0; we therefore estimate E(A) as 0. This is the answer that batch Monte 
Carlo methods give. Notice that it is also the answer that gives minimum squared error on 
the training data. In fact, it gives zero error on the data. But still we expect the first answer 
to be better. If the process is Markov, we expect that the first answer will produce lower 
error on future data, even though the Monte Carlo answer is better on the existing data. 


Example 6.4 illustrates a general difference between the estimates found by batch TD(0) and batch 
Monte Carlo methods. Batch Monte Carlo methods always find the estimates that minimize mean- 
squared error on the training set, whereas batch TD(0) always finds the estimates that would be 
exactly correct for the maximum-likelihood model of the Markov process. In general, the maximum- 
likelihood estimate of a parameter is the parameter value whose probability of generating the data is 
greatest. In this case, the maximum-likelihood estimate is the model of the Markov process formed 
in the obvious way from the observed episodes: the estimated transition probability from i to j is 
the fraction of observed transitions from i that went to j, and the associated expected reward is the 
average of the rewards observed on those transitions. Given this model, we can compute the estimate 
of the value function that would be exactly correct if the model were exactly correct. This is called the 
certainty-equivalence estimate because it is equivalent to assuming that the estimate of the underlying 
process was known with certainty rather than being approximated. In general, batch TD(0) converges 
to the certainty-equivalence estimate. 
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This helps explain why TD methods converge more quickly than Monte Carlo methods. In batch 
form, TD(0) is faster than Monte Carlo methods because it computes the true certainty-equivalence 
estimate. This explains the advantage of TD(0) shown in the batch results on the random walk task 
(Figure 6.2). The relationship to the certainty-equivalence estimate may also explain in part the speed 
advantage of nonbatch TD(0) (e.g., Example 6.2, 102, right graph). Although the nonbatclr methods 
do not achieve either the certainty-equivalence or the minimum squared-error estimates, they can be 
understood as moving roughly in these directions. Nonbatch TD(0) may be faster than constant-a MC 
because it is moving toward a better estimate, even though it is not getting all the way there. At the 
current time nothing more definite can be said about the relative efficiency of on-line TD and Monte 
Carlo methods. 

Finally, it is worth noting that although the certainty-equivalence estimate is in some sense an 
optimal solution, it is almost never feasible to compute it directly. If N is the number of states, then 
just forming the maximum-likelihood estimate of the process may require N 2 memory, and computing 
the corresponding value function requires on the order of N 3 computational steps if done conventionally. 
In these terms it is indeed striking that TD methods can approximate the same solution using memory 
no more than N and repeated computations over the training set. On tasks with large state spaces, 
TD methods may be the only feasible way of approximating the certainty-equivalence solution. 

*Exercise 6.7 Design an off-policy version of the TD(0) update that can be used with arbitrary target 
policy 7r and covering behavior policy b, using at each step t the importance sampling ratio pt-.t (5.3). 
□ 


6.4 Sarsa: On-policy TD Control 


We turn now to the use of TD prediction methods for the control problem. As usual, we follow the 
pattern of generalized policy iteration (GPI), only this time using TD methods for the evaluation 
or prediction part. As with Monte Carlo methods, we face the need to trade off exploration and 
exploitation, and again approaches fall into two main classes: on-policy and off-policy. In this section 
we present an on-policy TD control method. 

The first step is to learn an action-value function rather than a state-value function. In particular, 
for an on-policy method we must estimate q n (s,a) for the current behavior policy 7r and for all states 
s and actions a. This can be done using essentially the same TD method described above for learning 
v n . Recall that an episode consists of an alternating sequence of states and state-action pairs: 



In the previous section we considered transitions from state to state and learned the values of states. 
Now we consider transitions from state-action pair to state-action pair, and learn the values of state- 
action pairs. Formally these cases are identical: they are both Markov chains with a reward process. 
The theorems assuring the convergence of state values under TD(0) also apply to the corresponding 
algorithm for action values: 


Q{S t , At) Q(S t , At) + a 


Rt+i + iQiSt+i, A t+ 1) — Q(S t , At) 


(6.7) 


This update is done after every transition from a nonterminal state St- If St +i is terminal, 
then Q(St+i, A t +i) is defined as zero. This rule uses every element of the quintuple of events, 
(St, At, Rt+i, St+i, At+i), that make up a transition from one state-action pair to the next. 
This quintuple gives rise to the name Sarsa for the algorithm. The backup diagram for Sarsa 
is as shown to the right. 


! 

? 


Sarsa 
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Exercise 6.8 Show that an action-value version of (6.6) holds for the action-value form of the TD 
error St = Rt+i + r yQ(St+i, A t +i — Q(St,A t ), again assuming that the values don’t change from step to 
step. □ 

It is straightforward to design an on-policy control algorithm based on the Sarsa prediction method. 
As in all on-policy methods, we continually estimate q for the behavior policy n, and at the same time 
change 7r toward greediness with respect to . The general form of the Sarsa control algorithm is given 
in the box below. 



The convergence properties of the Sarsa algorithm depend on the nature of the policy’s dependence 
on Q. For example, one could use £-greedy or e-soft policies. Sarsa converges with probability 1 to an 
optimal policy and action-value function as long as all state-action pairs are visited an infinite number 
of times and the policy converges in the limit to the greedy policy (which can be arranged, for example, 
with e-greedy policies by setting e = 1/t). 

Example 6.5: Windy Gridworld Shown inset in Figure 6.3 is a standard gridworld, with start and 
goal states, but with one difference: there is a crosswind upward through the middle of the grid. The 
actions are the standard four—up, down, right, and left -but in the middle region the resultant 
next states are shifted upward by a “wind,” the strength of which varies from column to column. The 
strength of the wind is given below each column, in number of cells shifted upward. For example, if 
you are one cell to the right of the goal, then the action left takes you to the cell just above the goal. 
Let us treat this as an undiscounted episodic task, with constant rewards of —1 until the goal state is 
reached. 


Episodes 



Time steps 


Figure 6.3: Results of Sarsa applied to a gridworld (shown inset) in which movement is altered by a location- 
dependent, upward “wind.” A trajectory under the optimal policy is also shown. 
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The graph in Figure 6.3 shows the results of applying e-greedy Sarsa to this task, with e = 0.1, 
a = 0.5, and the initial values Q(s,a ) = 0 for all s,a. The increasing slope of the graph shows 
that the goal is reached more and more quickly over time. By 8000 time steps, the greedy policy 
was long since optimal (a trajectory from it is shown inset); continued £-greedy exploration kept the 
average episode length at about 17 steps, two more than the minimum of 15. Note that Monte Carlo 
methods cannot easily be used on this task because termination is not guaranteed for all policies. 
If a policy was ever found that caused the agent to stay in the same state, then the next episode 
would never end. Step-by-step learning methods such as Sarsa do not have this problem because 
they quickly learn during the episode that such policies are poor, and switch to something else. 


Exercise 6.9: Windy Gridworld with King’s Moves Re-solve the windy gridworld task assuming eight 
possible actions, including the diagonal moves, rather than the usual four. How much better can you do 
with the extra actions? Can you do even better by including a ninth action that causes no movement 
at all other than that caused by the wind? □ 

Exercise 6.10: Stochastic Wind Re-solve the windy gridworld task with King’s moves, assuming 
that the effect of the wind, if there is any, is stochastic, sometimes varying by 1 from the mean values 
given for each column. That is, a third of the time you move exactly according to these values, as in 
the previous exercise, but also a third of the time you move one cell above that, and another third of 
the time you move one cell below that. For example, if you are one cell to the right of the goal and 
you move left, then one-third of the time you move one cell above the goal, one-third of the time you 
move two cells above the goal, and one-third of the time you move to the goal. □ 


6.5 Q-learning: Off-policy TD Control 


One of the early breakthroughs in reinforcement learning was the development of an off-policy TD 
control algorithm known as Q-learning (Watkins, 1989), defined by 


Q{St, A t ) f— Q(St., At) + a Rt+i + 7 max Q(<S , t-t-i 5 a ) ~~ Q(St, At) 


( 6 . 8 ) 


In this case, the learned action-value function, Q , directly approximates g*, the optimal action-value 
function, independent of the policy being followed. This dramatically simplifies the analysis of the 
algorithm and enabled early convergence proofs. The policy still has an effect in that it determines 
which state-action pairs are visited and updated. However, all that is required for correct convergence 
is that all pairs continue to be updated. As we observed in Chapter 5, this is a minimal requirement 
in the sense that any method guaranteed to find optimal behavior in the general case must require it. 
Under this assumption and a variant of the usual stochastic approximation conditions on the sequence 
of step-size parameters, Q has been shown to converge with probability 1 to g*. 


Q-learning (off-policy TD control) for estimating ir ss 7r* 


Initialize Q(s, a), for all s £ S,a £ A(s), arbitrarily, and Q{terminal-state, •) = 0 
Repeat (for each episode): 

Initialize S 

Repeat (for each step of episode): 

Choose A from S using policy derived from Q (e.g., e-greedy) 

Take action A , observe R, S' 

Q(S , A) <- Q(S, A)+a[R + 7 max 0 Q{S ', a) - Q{S, A)] 

S£- S' 

until S is terminal 





108 


CHAPTER 6. TEMPORAL-DIFFERENCE LEARNING 


What is the backup diagram for Q-learning? The rule (6.8) updates a state-action pair, so the top 
node, the root of the update, must be a small, filled action node. The update is also from action nodes, 
maximizing over all those actions possible in the next state. Thus the bottom nodes of the backup 
diagram should be all these action nodes. Finally, remember that we indicate taking the maximum of 
these “next action” nodes with an arc across them (Figure 3.5-right). Can you guess now what the 
diagram is? If so, please do make a guess before turning to the answer in Figure 6.5 on page 109. 

Example 6.6: Cliff Walking This gridworld example compares Sarsa and Q-learning, highlighting 
the difference between on-policy (Sarsa) and off-policy (Q-learning) methods. Consider the gridworld 
shown in the upper part of Figure 6.4. This is a standard undiscounted, episodic task, with start and 
goal states, and the usual actions causing movement up, down, right, and left. Reward is —1 on all 
transitions except those into the region marked “The Cliff.” Stepping into this region incurs a reward 
of —100 and sends the agent instantly back to the start. 

The lower part of Figure 6.4 shows the performance of the Sarsa and Q-learning methods with e- 
greedy action selection, e = 0.1. After an initial transient, Q-learning learns values for the optimal 
policy, that which travels right along the edge of the cliff. Unfortunately, this results in its occasionally 
falling off the cliff because of the £-greedy action selection. Sarsa, on the other hand, takes the action 
selection into account and learns the longer but safer path through the upper part of the grid. Although 
Q-learning actually learns the values of the optimal policy, its on-line performance is worse than that of 
Sarsa, which learns the roundabout policy. Of course, if s were gradually reduced, then both methods 
would asymptotically converge to the optimal policy. 



Figure 6.4: The cliff-walking task. The results are from a single run, but smoothed by averaging the reward 
sums from 10 successive episodes. ■ 


Exercise 6.11 Why is Q-learning considered an off-policy control method? 


□ 
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Q-learning 



Expected Sarsa 


Figure 6.5: The backup diagrams for Q-learning and expected Sarsa. 


6.6 Expected Sarsa 


Consider the learning algorithm that is just like Q-learning except that instead of the maximum over 
next state-action pairs it uses the expected value, taking into account how likely each action is under 
the current policy. That is, consider the algorithm with the update rule 


Q(S u A t )^Q(S u A t )+a[R t+1 + 1 E[Q(S t+ll Ai +1 ) \ S t+1 ] - Q(S tl A t ) 
Q(S t , A t ) + a Rt+i + 7 7 r(a|iS't+i)( 5 ( 5 't+i, a) — Q(St, A t ) 


(6.9) 


but that otherwise follows the schema of Q-learning. Given the next state, St+i, this algorithm moves 
deterministically in the same direction as Sarsa moves in expectation , and accordingly it is called 
Expected Sarsa. Its backup diagram is shown on the right in Figure 6.5. 

Expected Sarsa is more complex computationally than Sarsa but, in return, it eliminates the variance 
due to the random selection of A t +±. Given the same amount of experience we might expect it to 
perform slightly better than Sarsa, and indeed it generally does. Figure 6.6 shows summary results 
on the cliff-walking task with Expected Sarsa compared to Sarsa and Q-learning. Expected Sarsa 
retains the significant advantage of Sarsa over Q-learning on this problem. In addition, Expected Sarsa 
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Figure 6.6: Interim and asymptotic performance of TD control methods on the cliff-walking task as a function 
of a. All algorithms used an e-greedy policy with e = 0.1. Asymptotic performance is an average over 100,000 
episodes whereas interim performance is an average over the first 100 episodes. These data are averages of over 
50,000 and 10 runs for the interim and asymptotic cases respectively. The solid circles mark the best interim 
performance of each method. Adapted from van Seijen et al. (2009). 
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shows a significant improvement over Sarsa over a wide range of values for the step-size parameter a. 
In cliff walking the state transitions are all deterministic and all randomness comes from the policy. 
In such cases, Expected Sarsa can safely set a = 1 without suffering any degradation of asymptotic 
performance, whereas Sarsa can only perform well in the long run at a small value of a, at which 
short-term performance is poor. In this and other examples there is a consistent empirical advantage 
of Expected Sarsa over Sarsa. 

In these cliff walking results Expected Sarsa was used on-policy, but in general it might use a policy 
different from the target policy tt to generate behavior, in which case it becomes an off-policy algo¬ 
rithm. For example, suppose 7r is the greedy policy while behavior is more exploratory; then Expected 
Sarsa is exactly Q-learning. In this sense Expected Sarsa subsumes and generalizes Q-learning while 
reliably improving over Sarsa. Except for the small additional computational cost, Expected Sarsa may 
completely dominate both of the other more-well-known TD control algorithms. 


6.7 Maximization Bias and Double Learning 

All the control algorithms that we have discussed so far involve maximization in the construction of 
their target policies. For example, in Q-learning the target policy is the greedy policy given the current 
action values, which is defined with a max, and in Sarsa the policy is often e-greedy, which also involves 
a maximization operation. In these algorithms, a maximum over estimated values is used implicitly as 
an estimate of the maximum value, which can lead to a significant positive bias. To see why, consider a 
single state s where there are many actions a whose true values, q(s, a), are all zero but whose estimated 
values, Q(s,a), are uncertain and thus distributed some above and some below zero. The maximum 
of the true values is zero, but the maximum of the estimates is positive, a positive bias. We call this 
maximization bias. 

Example 6.7: Maximization Bias Example The small MDP shown inset in Figure 6.7 provides 
a simple example of how maximization bias can harm the performance of TD control algorithms. The 
MDP has two non-terminal states A and B. Episodes always start in A with a choice between two 
actions, left and right. The right action transitions immediately to the terminal state with a reward 
and return of zero. The left action transitions to B, also with a reward of zero, from which there are 



Figure 6.7: Comparison of Q-learning and Double Q-learning on a simple episodic MDP (shown inset). Q- 
learning initially learns to take the left action much more often than the right action, and always takes it 
significantly more often than the 5% minimum probability enforced by e-greedy action selection with e = 0.1. In 
contrast, Double Q-learning is essentially unaffected by maximization bias. These data are averaged over 10,000 
runs. The initial action-value estimates were zero. Any ties in e-greedy action selection were broken randomly. 
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many possible actions all of which cause immediate termination with a reward drawn from a normal 
distribution with mean —0.1 and variance 1.0. Thus, the expected return for any trajectory starting 
with left is —0.1, and thus taking left in state A is always a mistake. Nevertheless, our control methods 
may favor left because of maximization bias making B appear to have a positive value. Figure 6.7 shows 
that Q-learning with e-greedy action selection initially learns to strongly favor the left action on this 
example. Even at asymptote, Q-learning takes the left action about 5% more often than is optimal at 
our parameter settings (e = 0 . 1 , a = 0 . 1 , and 7 = 1 ). 

Are there algorithms that avoid maximization bias? To start, consider a bandit case in which we have 
noisy estimates of the value of each of many actions, obtained as sample averages of the rewards received 
on all the plays with each action. As we discussed above, there will be a positive maximization bias if 
we use the maximum of the estimates as an estimate of the maximum of the true values. One way to 
view the problem is that it is due to using the same samples (plays) both to determine the maximizing 
action and to estimate its value. Suppose we divided the plays in two sets and used them to learn two 
independent estimates, call them Q\[a) and < 52 ( 0 ), each an estimate of the true value q{a), for all a £ A. 
We could then use one estimate, say Q 1 , to determine the maximizing action A* = argmax 0 Qi(a), and 
the other, Q 2 , to provide the estimate of its value, Q 2 (A*) = Q 2 (argmax 0 Qi[a)). This estimate will 
then be unbiased in the sense that E[<3 2 (A*)] = q[A*). We can also repeat the process with the role of 
the two estimates reversed to yield a second unbiased estimate Qi(argmax a Q 2 (a)). This is the idea of 
double learning. Note that although we learn two estimates, only one estimate is updated on each play; 
double learning doubles the memory requirements, but does not increase the amount of computation 
per step. 

The idea of double learning extends naturally to algorithms for full MDPs. For example, the double 
learning algorithm analogous to Q-learning, called Double Q-learning, divides the time steps in two, 
perhaps by flipping a coin on each step. If the coin comes up heads, the update is 


Qi{S t , A t ) <— Qi(S t , A t ) + a Rt+i + 7 Q 2 (»S)+i, argmax(5i(S' t+ i, o)) — Qi[St, A t ) 


( 6 . 10 ) 


If the coin comes up tails, then the same update is done with Q 1 and Q 2 switched, so that Q 2 is updated. 
The two approximate value functions are treated completely symmetrically. The behavior policy can 
use both action-value estimates. For example, an £-greedy policy for Double Q-learning could be based 
on the average (or sum) of the two action-value estimates. A complete algorithm for Double Q-learning 
is given below. This is the algorithm used to produce the results in Figure 6.7. In that example, double 
learning seems to eliminate the harm caused by maximization bias. Of course there are also double 
versions of Sarsa and Expected Sarsa. 


Double Q-learning 


Initialize Qi(s,a) and Q 2 {s,a), for all s £ S,a £ A(s), arbitrarily 
Initialize Qi{terminal-state, •) = Q 2 {terminal-state , ■) = 0 
Repeat (for each episode): 

Initialize S 

Repeat (for each step of episode): 

Choose A from S using policy derived from Qi and Q 2 (e.g., e-greedy in Q 1 + Q 2 ) 
Take action A, observe R, S' 

With 0.5 probabilility: 

Qi(S, A) <- Qi(5, A) + Q^i? + 7 Q 2 (S', argmax a 0i(5',o)) — Qi{S, A)^ 
else: 

02 ( 5 , A) <— Q 2 (5, A) + a^R + 7 Q 1 [S' , argmax a Q 2 (S', a)) — Q 2 (S, A)^ 

Si-S' 

until S is terminal 
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*Exercise 6.12 What are the update equations for Double Expected Sarsa with an e-greedy target 
policy? □ 


6.8 Games, Afterstates, and Other Special Cases 

In this book we try to present a uniform approach to a wide class of tasks, but of course there are 
always exceptional tasks that are better treated in a specialized way. For example, our general approach 
involves learning an action-value function, but in Chapter 1 we presented a TD method for learning 
to play tic-tac-toe that learned something much more like a state-value function. If we look closely at 
that example, it becomes apparent that the function learned there is neither an action-value function 
nor a state-value function in the usual sense. A conventional state-value function evaluates states in 
which the agent has the option of selecting an action, but the state-value function used in tic-tac-toe 
evaluates board positions after the agent has made its move. Let us call these afterstates , and value 
functions over these, afterstate value functions. Afterstates are useful when we have knowledge of an 
initial part of the environment’s dynamics but not necessarily of the full dynamics. For example, in 
games we typically know the immediate effects of our moves. We know for each possible chess move 
what the resulting position will be, but not how our opponent will reply. Afterstate value functions are 
a natural way to take advantage of this kind of knowledge and thereby produce a more efficient learning 
method. 

The reason it is more efficient to design algorithms in terms of afterstates is apparent from the 
tic-tac-toe example. A conventional action-value function would map from positions and moves to an 
estimate of the value. But many position-move pairs produce the same resulting position, as in this 
example: 


X 
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+ 

X 

0 

X 









X 




0 

X 





In such cases the position-move pairs are different but produce the same “afterposition,” and thus must 
have the same value. A conventional action-value function would have to separately assess both pairs, 
whereas an afterstate value function would immediately assess both equally. Any learning about the 
position-move pair on the left would immediately transfer to the pair on the right. 

Afterstates arise in many tasks, not just games. For example, in queuing tasks there are actions 
such as assigning customers to servers, rejecting customers, or discarding information. In such cases 
the actions are in fact defined in terms of their immediate effects, which are completely known. 

It is impossible to describe all the possible kinds of specialized problems and corresponding specialized 
learning algorithms. However, the principles developed in this book should apply widely. For example, 
afterstate methods are still aptly described in terms of generalized policy iteration, with a policy and 
(afterstate) value function interacting in essentially the same way. In many cases one will still face the 
choice between on-policy and off-policy methods for managing the need for persistent exploration. 

Exercise 6.13 Describe how the task of Jack’s Car Rental (Example 4.2) could be reformulated in 
terms of afterstates. Why, in terms of this specific task, would such a reformulation be likely to speed 
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convergence? 


□ 


6.9 Summary 

In this chapter we introduced a new kind of learning method, temporal-difference (TD) learning, and 
showed how it can be applied to the reinforcement learning problem. As usual, we divided the overall 
problem into a prediction problem and a control problem. TD methods are alternatives to Monte Carlo 
methods for solving the prediction problem. In both cases, the extension to the control problem is via 
the idea of generalized policy iteration (GPI) that we abstracted from dynamic programming. This is 
the idea that approximate policy and value functions should interact in such a way that they both move 
toward their optimal values. 

One of the two processes making up GPI drives the value function to accurately predict returns for 
the current policy; this is the prediction problem. The other process drives the policy to improve locally 
(e.g., to be e-greedy) with respect to the current value function. When the first process is based on 
experience, a complication arises concerning maintaining sufficient exploration. We can classify TD 
control methods according to whether they deal with this complication by using an on-policy or off- 
policy approach. Sarsa is an on-policy method, and Q-learning is an off-policy method. Expected Sarsa 
is also an off-policy method as we present it here. There is a third way in which TD methods can 
be extended to control which we did not include in this chapter, called actor-critic methods. These 
methods are covered in full in Chapter 13. 

The methods presented in this chapter are today the most widely used reinforcement learning meth¬ 
ods. This is probably due to their great simplicity: they can be applied on-line, with a minimal amount 
of computation, to experience generated from interaction with an environment; they can be expressed 
nearly completely by single equations that can be implemented with small computer programs. In the 
next few chapters we extend these algorithms, making them slightly more complicated and significantly 
more powerful. All the new algorithms will retain the essence of those introduced here: they will be 
able to process experience on-line, with relatively little computation, and they will be driven by TD 
errors. The special cases of TD methods introduced in the present chapter should rightly be called 
one-step, tabular, model-free TD methods. In the next two chapters we extend them to multistep forms 
(a link to Monte Carlo methods) and forms that include a model of the environment (a link to planning 
and dynamic programming). Then, in the second part of the book we extend them to various forms of 
function approximation rather than tables (a link to deep learning and artificial neural networks). 

Finally, in this chapter we have discussed TD methods entirely within the context of reinforcement 
learning problems, but TD methods are actually more general than this. They are general methods for 
learning to make long-term predictions about dynamical systems. For example, TD methods may be 
relevant to predicting financial data, life spans, election outcomes, weather patterns, animal behavior, 
demands on power stations, or customer purchases. It was only when TD methods were analyzed 
as pure prediction methods, independent of their use in reinforcement learning, that their theoretical 
properties first came to be well understood. Even so, these other potential applications of TD learning 
methods have not yet been extensively explored. 


Bibliographical and Historical Remarks 

As we outlined in Chapter 1, the idea of TD learning has its early roots in animal learning psychology 
and artificial intelligence, most notably the work of Samuel (1959) and Klopf (1972). Samuel’s work is 
described as a case study in Section 16.2. Also related to TD learning are Holland’s (1975, 1976) early 
ideas about consistency among value predictions. These influenced one of the authors (Barto), who 
was a graduate student from 1970 to 1975 at the University of Michigan, where Holland was teaching. 
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Holland’s ideas led to a number of TD-related systems, including the work of Booker (1982) and the 

bucket brigade of Holland (1986), which is related to Sarsa as discussed below. 

6.1—2 Most of the specific material from these sections is from Sutton (1988), including the TD(0) 
algorithm, the random walk example, and the term “temporal-difference learning.” The charac¬ 
terization of the relationship to dynamic programming and Monte Carlo methods was influenced 
by Watkins (1989), Werbos (1987), and others. The use of backup diagrams was new to the 
first edition of this book. 

Tabular TD(0) was proved to converge in the mean by Sutton (1988) and with probability 
1 by Dayan (1992), based on the work of Watkins and Dayan (1992). These results were 
extended and strengthened by Jaakkola, Jordan, and Singh (1994) and Tsitsiklis (1994) by 
using extensions of the powerful existing theory of stochastic approximation. Other extensions 
and generalizations are covered in later chapters. 

6.3 The optimality of the TD algorithm under batch training was established by Sutton (1988). 
Illuminating this result is Barnard’s (1993) derivation of the TD algorithm as a combination of 
one step of an incremental method for learning a model of the Markov chain and one step of a 
method for computing predictions from the model. The term certainty equivalence is from the 
adaptive control literature (e.g., Goodwin and Sin, 1984). 

6.4 The Sarsa algorithm was introduced by Rummery and Niranjan (1994). They explored it in 
conjunction with neural networks and called it “Modified Connectionist Q-learning”. The name 
“Sarsa” was introduced by Sutton (1996). The convergence of one-step tabular Sarsa (the form 
treated in this chapter) has been proved by Singh, Jaakkola, Littman, and Szepesvari (2000). 
The “windy gridworld” example was suggested by Tom Kalt. 

Holland’s (1986) bucket brigade idea evolved into an algorithm closely related to Sarsa. The 
original idea of the bucket brigade involved chains of rules triggering each other; it focused 
on passing credit back from the current rule to the rules that triggered it. Over time, the 
bucket brigade came to be more like TD learning in passing credit back to any temporally 
preceding rule, not just to the ones that triggered the current rule. The modern form of the 
bucket brigade, when simplified in various natural ways, is nearly identical to one-step Sarsa, 
as detailed by Wilson (1994). 

6.5 Q-learning was introduced by Watkins (1989), whose outline of a convergence proof was made 
rigorous by Watkins and Dayan (1992). More general convergence results were proved by 
Jaakkola, Jordan, and Singh (1994) and Tsitsiklis (1994). 

6.6 Expected Sarsa was first described in an exercise in the first edition of this book, then fully 
investigated by van Seijen, van Hasselt, Whiteson, and Weiring (2009). They established its 
convergence properties and conditions under which it will outperform regular Sarsa and Q- 
learning. Our Figure 6.6 is adapted from their results. Our presentation differs slightly from 
theirs in that they define “Expected Sarsa” to be an on-policy method exclusively, whereas we 
use this name for the general algorithm in which the target and behavior policies are allowed 
to differ. The general off-policy view of Expected Sarsa was first noted by van Hasselt (2011), 
who called it “General Q-learning”. 

6.7 Maximization bias and double learning were introduced and extensively investigated by Hado 
van Hasselt (2010, 2011). The example MDP in Figure 6.7 was adapted from that in his Figure 
4.1 (van Hasselt, 2011). 

6.8 The notion of an afterstate is the same as that of a “post-decision state” (Van Roy, Bertsekas, 
Lee, and Tsitsiklis, 1997; Powell, 2010). 



Chapter 7 


n-step Bootstrapping 


In this chapter we unify the Monte Carlo (MC) methods and the one-step temporal-difference (TD) 
methods presented in the previous two chapters. Neither MC methods nor one-step TD methods are 
always the best. In this chapter we present n-step TD methods that generalize both methods so that 
one can shift from one to the other smoothly as needed to meet the demands of a particular task, n-step 
methods span a spectrum with MC methods at one end and one-step TD methods at the other. The 
best methods are often intermediate between the two extremes. 

Another way of looking at the benefits of n-step methods is that they free you from the tyranny of 
the time step. With one-step TD methods the same time step determines how often the action can be 
changed and the time interval over which bootstrapping is done. In many applications one wants to be 
able to update the action very fast to take into account anything that has changed, but bootstrapping 
works best if it is over a length of time in which a significant and recognizable state change has occurred. 
With one-step TD methods, these time intervals are the same, and so a compromise must be made, 
n-step methods enable bootstrapping to occur over multiple steps, freeing us from the tyranny of the 
single time step. 

The idea of n-step methods is usually used as an introduction to the algorithmic idea of eligibility 
traces (Chapter 12), which enable bootstrapping over multiple time intervals simultaneously. Here we 
instead consider the n-step bootstrapping idea on its own, postponing the treatment of eligibility-trace 
mechanisms until later. This allows us to separate the issues better, dealing with as many of them as 
possible in the simpler n-step setting. 

As usual, we first consider the prediction problem and then the control problem. That is, we first 
consider how n-step methods can help in predicting returns as a function of state for a fixed policy (i.e., 
in estimating v^). Then we extend the ideas to action values and control methods. 


7.1 n-step TD Prediction 

What is the space of methods lying between Monte Carlo and TD methods? Consider estimating v 
from sample episodes generated using n. Monte Carlo methods perform an update for each state based 
on the entire sequence of observed rewards from that state until the end of the episode. The update 
of one-step TD methods, on the other hand, is based on just the one next reward, bootstrapping from 
the value of the state one step later as a proxy for the remaining rewards. One kind of intermediate 
method, then, would perform an update based on an intermediate number of rewards: more than one, 
but less than all of them until termination. For example, a two-step update would be based on the first 
two rewards and the estimated value of the state two steps later. Similarly, we could have three-step 
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updates, four-step updates, and so on. Figure 7.1 shows the backup diagrams of the spectrum of n-step 
updates for v n , with the one-step TD update on the left and the up-until-termination Monte Carlo 
update on the right. 

1 -step TD oo-step TD 

and TD(0) 2-step TD 3-step TD n-step TD and Monte Carlo 

? ? ? ? ? 

I ! ! ! ! 

° ? ? Y ? 

! T 1 ! 
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I 

□ 

Figure 7.1: The backup diagrams of n-step methods. These methods form a spectrum ranging from one-step 
TD methods to Monte Carlo methods. 

The methods that use n-step updates are still TD methods because they still change an earlier 
estimate based on how it differs from a later estimate. Now the later estimate is not one step later, 
but n steps later. Methods in which the temporal difference extends over n steps are called n-step TD 
methods. The TD methods introduced in the previous chapter all used one-step updates, which is why 
we called them one-step TD methods. 

More formally, consider the update of the estimated value of state S t as a result of the state-reward 
sequence, St, Rt.+ i, St+i, Rt+ 2 , ■ ■ ■, Rt, St (omitting the actions). We know that in Monte Carlo updates 
the estimate of v„(St) is updated in the direction of the complete return: 

Gt = Rt+1 + jRt+2 + T 2 Rt+3 + • • • + 7 T-f 1 Rt, 

where T is the last time step of the episode. Let us call this quantity the target of the update. Whereas 
in Monte Carlo updates the target is the return, in one-step updates the target is the first reward plus 
the discounted estimated value of the next state, which we call the one-step return: 

Gt:t + l=R t+ l+lV t (S t+1 ), 

where V t : § —> R here is the estimate at time f of The subscripts on G t:t+ 1 indicate that it is a 
truncated return for time t using rewards up until time t + 1 , with the discounted estimate 'yVt(St+i) 
taking the place of the other terms jRt +2 + r ) 2 Rt +3 + • • • + 7 of the full return, as discussed 
in the previous chapter. Our point now is that this idea makes just as much sense after two steps as it 
does after one. The target for a two-step update is the two-step return: 

|-2 = R t +i +jR t +2 +7 2 V) + i(S't +2 ), 

where now r ) 2 V t +i{S t +2) corrects for the absence of the terms j 2 R t +3 + j 3 Rt+4 + • • • + 7 T ~ t ~ 1 R.T- 
Similarly, the target for an arbitrary n-step update is the n-step return: 

Gf.t+n = Rt+l + jRt+2 + • • • + 7 " 1 Rt+n + 'y n Vt+n-l(St+n) i 


(7.1) 
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for all n, t such that n > 1 and 0 < t < T—n. All ?r-step returns can be considered approximations to the 
full return, truncated after n steps and then corrected for the remaining missing terms by Vt+ n -i(St+ n )- 
If t+n > T (if the ?r-step return extends to or beyond termination), then all the missing terms are taken 
as zero, and the n-step return defined to be equal to the ordinary full return ( Gf.t+n = G t if t + n > T). 

Note that ?r-step returns for n > 1 involve future rewards and states that are not available at the 
time of transition from t to t + 1. No real algorithm can use the n-step return until after it has seen 
Rt+n and computed Vt+ n - 1 - The first time these are available is t + n. The natural state-value learning 
algorithm for using n-step returns is thus 

Vt^ l (S t ) = V t+n - 1 (S t ) + a[G t:t+n -V t+n - 1 (S t )], 0 <t<T, (7.2) 

while the values of all other states remain unchanged: V t + n (s) = V) + „_i(s), for all s^S t . We call this 
algorithm n-step TD. Note that no changes at all are made during the first n — 1 steps of each episode. 
To make up for that, an equal number of additional updates are made at the end of the episode, after 
termination and before starting the next episode. 


n-step TD for estimating V ~ ty 


Initialize V ( s ) arbitrarily, s£ S 

Parameters: step size a £ (0,1], a positive integer n 

All store and access operations (for St and Rt) can take their index mod n 

Repeat (for each episode): 

Initialize and store So ^ terminal 
T t— oo 

For t = 0,1, 2,... : 

| If t < T, then: 

Take an action according to 7r(-|«St) 

Observe and store the next reward as Rt.+i and the next state as St+i 
If S t + 1 is terminal, then T <— t + 1 

rf- t — n + 1 (t is the time whose state’s estimate is being updated) 

| If r > 0: 

G £- ^ min ( T + n ’ T ) sy-i-T-lft . 

If T+n < T, then: G t — G + y n P(5 , T ^_ n ) (G T . T + n ) 

| V(S T ) <- V(Sr) + a[G — V(S T )\ 

Until r = T — 1 


Exercise 7.1 In Chapter 6 we noted that the Monte Carlo error can be written as the sum of TD 
errors (6.6) if the value estimates don’t change from step to step. Show that the n-step error used in 
(7.2) can also be written as a sum TD errors (again if the value estimates don’t change) generalizing 
the earlier result. □ 

Exercise 7.2 (programming) With an n-step method, the value estimates do change from step to 
step, so an algorithm that used the sum of TD errors (see previous exercise) in place of the error in 
(7.2) would actually be a slightly different algorithm. Would it be a better algorithm or a worse one? 
Devise and program a small experiment to answer this question empirically. □ 

The n-step return uses the value function Vt + n -1 to correct for the missing rewards beyond Rt+ n - 
An important property of n-step returns is that their expectation is guaranteed to be a better estimate 
of v v than V t +„_i is, in a worst-state sense. That is, the worst error of the expected n-step return is 
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guaranteed to be less than or equal to 7 " times the worst error under V t + n -\: 


max 


^7r[^t:t+rt|^£— V- K (s) 


< 7 ™ max 


hf+n—1(^) ^7r(^) 


(7.3) 


for all n > 1. This is called the error reduction property of n-step returns. Because of the error reduction 
property, one can show formally that all n-step TD methods converge to the correct predictions under 
appropriate technical conditions. The n-step TD methods thus form a family of sound methods, with 
one-step TD methods and Monte Carlo methods as extreme members. 

Example 7.1: n-step TD Methods on the Random Walk Consider using n-step TD methods 
on the 5-state random walk task described in Example 6.2. Suppose the first episode progressed directly 
from the center state, C, to the right, through D and E, and then terminated on the right with a return 
of 1. Recall that the estimated values of all the states started at an intermediate value, R(s) = 0.5. 
As a result of this experience, a one-step method would change only the estimate for the last state, 
E(E), which would be incremented toward 1, the observed return. A two-step method, on the other 
hand, would increment the values of the two states preceding termination: R(D) and E(E) both would 
be incremented toward 1. A three-step method, or any n-step method for n > 2, would increment the 
values of all three of the visited states toward 1 , all by the same amount. 



Figure 7.2: Performance of n-step TD methods as a function of a, for various values of n, on a 19-state random 
walk task (Example 7.1). 


Which value of n is better? Figure 7.2 shows the results of a simple empirical test for a larger random 
walk process, with 19 states instead of 5 (and with a —1 outcome on the left, all values initialized to 0), 
which we use as a running example in this chapter. Results are shown for n-step TD methods with a 
range of values for n and a. The performance measure for each parameter setting, shown on the vertical 
axis, is the square-root of the average squared error between the predictions at the end of the episode 
for the 19 states and their true values, then averaged over the first 10 episodes and 100 repetitions of the 
whole experiment (the same sets of walks were used for all parameter settings). Note that methods with 
an intermediate value of n worked best. This illustrates how the generalization of TD and Monte Carlo 
methods to n-step methods can potentially perform better than either of the two extreme methods. 
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Exercise 7.3 Why do you think a larger random walk task (19 states instead of 5) was used in the 
examples of this chapter? Would a smaller walk have shifted the advantage to a different value of n? 
How about the change in left-side outcome from 0 to —1 made in the larger walk? Do you think that 
made any difference in the best value of n? □ 


7.2 n-step Sarsa 

How can n-step methods be used not just for prediction, but for control? In this section we show 
how n-step methods can be combined with Sarsa in a straightforward way to produce an on-policy TD 
control method. The ?z-step version of Sarsa we call n-step Sarsa, and the original version presented in 
the previous chapter we henceforth call one-step Sarsa , or Sarsa(O). 

The main idea is to simply switch states for actions (state-action pairs) and then use an s-greedy 
policy. The backup diagrams for n-step Sarsa (shown in Figure 7.3), like those of n-step TD (Figure 7.1), 
are strings of alternating states and actions, except that the Sarsa ones all start and end with an action 
rather a state. We redefine n-step returns (update targets) in terms of estimated action values: 

Gf.t+n = Rt+1 + jRt+2 + ' ' ' + 7 " 1 Rt+n l n Qt+n-\{St+m A-t+n), Tl > 1, 0 < t < T — 71, (7.4) 

with Gf.t.+n = G t if t + n > T. The natural algorithm is then 

Qt+n(Rti At) = Qt+n-l{Rti At) + a [Gf.t+n ~ Qt+n-l{Rti At)] , 0 < t < T, (7.5) 

while the values of all other states remain unchanged: Qt+ n {s,a) = Qt+ n -i(s,a), for all s,a such that 
s ^ S t or a ^ A t . This is the algorithm we call n-step Sarsa. Pseudocode is shown in the box on the 
next page, and an example of why it can speed up learning compared to one-step methods is given in 
Figure 7.4. 


1 -step Sarsa 

aka Sarsa(O) 2-step Sarsa 3-step Sarsa 


oo-step Sarsa n-step 

n-step Sarsa aka Monte Carlo Expected Sarsa 



Figure 7.3: The backup diagrams for the spectrum of n-step methods for state-action values. They range from 
the one-step update of Sarsa(O) to the up-until-termination update of the Monte Carlo method. In between are 
the n-step updates, based on n steps of real rewards and the estimated value of the nth next state-action pair, 
all appropriately discounted. On the far right is the backup diagram for n-step Expected Sarsa. 
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Action values increased 
Path taken by 0 ne-step Sarsa 



Action values increased 
by 10-step Sarsa 
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Figure 7.4: Gridworld example of the speedup of policy learning due to the use of n-step methods. The first 
panel shows the path taken by an agent in a single episode, ending at a location of high reward, marked by 
the G. In this example the values were all initially 0, and all rewards were zero except for a positive reward at 
G. The arrows in the other two panels show which action values were strengthened as a result of this path by 
one-step and n-step Sarsa methods. The one-step method strengthens only the last action of the sequence of 
actions that led to the high reward, whereas the n-step method strengthens the last n actions of the sequence, 
so that much more is learned from the one episode. 

What about Expected Sarsa? The backup diagram for the n-step version of Expected Sarsa is shown 
on the far right in Figure 7.3. It consists of a linear string of sample actions and states, just as in n-step 
Sarsa, except that its last element is a branch over all action possibilities weighted, as always, by their 
probability under n. This algorithm can be described by the same equation as n-step Sarsa (above) 
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except with the n-step return redefined as 

Gf.t+n = Rt+l + ' ' • + 7” 1 Rt.+n + 7” '^(a|<S , t+n)Qt+ra-l(<5't+w; «); (7.6) 

a 

for all n and t such that n > 1 and 0 < t < T — n. 


7.3 n- step Off-policy Learning by Importance Sampling 


Recall that off-policy learning is learning the value function for one policy, 7r, while following another 
policy, b. Often, 7r is the greedy policy for the current action-value-function estimate, and b is a more 
exploratory policy, perhaps £-greedy. In order to use the data from b we must take into account the 
difference between the two policies, using their relative probability of taking the actions that were taken 
(see Section 5.5). In n-step methods, returns are constructed over n steps, so we are interested in the 
relative probability of just those n actions. For example, to make a simple off-policy version of n-step 
TD, the update for time t (actually made at time t + n) can simply be weighted by p t:t + n - 1 : 

Vt.+n{St) — Vt+n-l{St) + Oipt-t+n-1 [Gf.t+n ~ Vt.+n-l (St)] > 0 < t < T, (7-7) 


where pt+.+n-i, called the importance sampling ratio , is the relative probability under the two policies 
of taking the n actions from A t to A t+n _ 1 (cf. Eq. 5.3): 


min(h,T— 1) 

Pt'.h = 

k—t 


n(A k \S k ) 

b(A k \S k )' 


(7.8) 


For example, if any one of the actions would never be taken by ir (i.e., Tr(A k \S k ) = 0) then the n-step 
return should be given zero weight and be totally ignored. On the other hand, if by chance an action is 
taken that 7 r would take with much greater probability than b does, then this will increase the weight 
that would otherwise be given to the return. This makes sense because that action is characteristic of 7 r 
(and therefore we want to learn about it) but is selected only rarely by b and thus rarely appears in the 
data. To make up for this we have to over-weight it when it does occur. Note that if the two policies 
are actually the same (the on-policy case) then the importance sampling ratio is always 1. Thus our 
new update (7.7) generalizes and can completely replace our earlier n-step TD update. Similarly, our 
previous n-step Sarsa update can be completely replaced by a simple off-policy form: 


Qt+n(St , At) — Qt+n-l{St , A t ) + aPt + l:t+n-l [Gf.t+n ~ Qt+n-l(Sti At)] , (7-9) 

for 0 < t < T. Note that the importance sampling ratio here starts one step later than for n-step TD 
(above). This is because here we are updating a state-action pair. We do not have to care how likely 
we were to select the action; now that we have selected it we want to learn fully from what happens, 
with importance sampling only for subsequent actions. Pseudocode for the full algorithm is shown in 
the box. 

The off-policy version of n-step Expected Sarsa would use the same update as above for n-step Sarsa 
except that the importance sampling ratio would have one less factor in it. That is, the above equation 
would use pt+\-.t + n -2 instead of Pt+v.t+n-i, an d of course it would use the Expected Sarsa version of 
the ?z-step return (7.6). This is because in Expected Sarsa all possible actions are taken into account 
in the last state; the one actually taken has no effect and does not have to be corrected for. 
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Off-policy n-step Sarsa for estimating Q ss < 7 *, or Q ss q n for a given ir 


Input: an arbitrary behavior policy b such that 6(a|s) > 0, for all s € §,a £ A 
Initialize Q(s,a) arbitrarily, for all s G S,a G A 

Initialize 7 r to be £-greedy with respect to Q, or as a fixed given policy 

Parameters: step size a € (0,1], small £ > 0, a positive integer n 

All store and access operations (for S t , A t , and R t ) can take their index mod n 


Repeat (for each episode): 

Initialize and store Sq ^ terminal 
Select and store an action A 0 ~ 6(-|5o) 

T t— 00 

For t = 0, 1,2, ... : 

| If t < T, then: 

Take action A t 

Observe and store the next reward as Rt+i and the next state as St+i 
If S t +i is terminal, then: 

T<-t + 1 
else: 

Select and store an action A t+1 ~ 6(-|5* + i) 
r •<— t — n + 1 (t is the time whose estimate is being updated) 

If r > 0: 


I i-rmin(T+ra—l.T-l) / 

I ' ' lli— t-I-1 b(Ai\Si) 1 

I G^j2tr i +i n,T) 'y i - T - 1 Ri 

If t + n <T, then: G <— G + 'y n Q{S T+n , A 

r+n) ( 

Q(S t , Aj -) Q(S r , A t ) + exp [G — Q(S t , A t )] 

If 7 t is being learned, then ensure that 7r(-|5 T ) is £-greedy wrt Q 
Until t = T- 1 


^r+l:t+n- 


7.4 *Per-reward Off-policy Methods 

The multi-step off-policy methods presented in the previous section are very simple and conceptually 
clear, but are probably not the most efficient. A more sophisticated approach would use per-reward 
importance sampling ideas such as were introduced in Section 5.9. To understand this approach, first 
note that the ordinary n-step return (7.1), like all returns, can be written recursively: 

Gf.h = Rt+i + 'jGt+i-.h- 

Now consider the effect of following a behavior policy b ^ tt that is not the same as the target policy 
7 r. All of the resulting experience, including the first reward Rt+i and the next state St+ 1 must be 
weighted by the importance sampling ratio for time t, pt = 1(Atjsi) • ® ne m ight be tempted to simply 
weight the righthand side of the above equation, but one can do better. Suppose the action at time t 
would never be selected by 7 r, so that pt is zero. Then a simple weighting would result in the n-step 
return being zero, which could result in high variance when it was used as a target. Instead, in this 
more sophisticated approach, one uses an alternate, off-policy definition of the n-step return, as 

Gt-.h = Pt (Rt.+i + 'yGt+i-.h) + (1 ~ Pt)Vh-i(St), t < h < T, (7-10) 

where G t:t = V)_ i (S t ). Now, if p t is zero, instead of the target being zero and causing the estimate to 
shrink, the target is the same as the estimate and causes no change. The importance sampling ratio 
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being zero means we should ignore the sample, so leaving the estimate unchanged seems an appropriate 
outcome. Notice that the second, additional term does not change the expected update; the importance 
sampling ratio has expected value one (Section 5.9) and is uncorrelated with the estimate, so the 
expected value of the second term is zero. Also note that the off-policy definition (7.10) is a strict 
generalization of the earlier on-policy definition of the ?r-step return (7.1), as the two are identical in 
the on-policy case, in which p t is always 1. 

For a conventional n-step method, the learning rule to use in conjunction with (7.10) is the n-step 
TD update (7.2), which has no explicit importance sampling ratios other than those embedded in G. 

Exercise 7.4 Write the pseudocode for the off-policy state-value prediction algorithm described above. 

□ 

For action values, the off-policy definition of the n-step return is a little different because the first 
action does not play a role in the importance sampling. We are learning the value of that action and it 
does not matter if it was unlikely or even impossible under the target policy. It has been taken and now 
full unit weight must be given to the reward and state that follows it. Importance sampling will apply 
only to the actions that follow it. The off-policy recursive definition of the ?r-step return for action 
values is 

Gt-.h = Rt+i + 7 (pt+iGt+i-.h + (1 — p t +i)Qt+i) i t < h < T, (7-11) 

with Q t = Y2 a TT{a\S t )Qt-i(S t ,a). A complete n-step off-policy action-value prediction algorithm would 
combine (7.11) and (7.5). If the recursion ends with Gp.t. = Q(St,A t ) 1 then the resultant algorithm is 
analogous to Sarsa, and if G t -t = Qt ., then the resultant algorithm is analogous to Expected Sarsa. 

Exercise 7.5 Write the pseudocode for the off-policy action-value prediction algorithm described 
immediately above. Specify both Sarsa and Expected Sarsa variations. □ 

Exercise 7.6 Show that the general (off-policy) version of the n-step return (7.10) can still be written 
exactly and compactly as the sum of state-based TD errors (6.5) if the approximate state value function 
does not change. □ 

Exercise 7.7 Repeat the above exercise for the action version of the off-policy n-step return (7.11) 
and the Expected Sarsa TD error (the quantity in brackets in Equation 6.9). □ 

Exercise 7.8 (programming) Devise a small off-policy prediction problem and use it to show that 
the off-policy learning algorithm using (7.10) and (7.2) is more data efficient than the simpler algorithm 
using (7.1) and (7.7). □ 

The importance sampling that we have used in this section, the previous section, and in Chapter 5 
enables off-policy learning, but at the cost of increasing the variance of the updates. The high variance 
forces us to use a small step-size parameter, resulting in slow learning. It is probably inevitable that 
off-policy training is slower than on-policy training—after all, the data is less relevant to what you 
are trying to learn. However, it is probably also true that the methods we have presented here can 
be improved on. One possibility is to rapidly adapt the step sizes to the observed variance, as in the 
Autostep method (Mahmood et al, 2012). Another promising approach is the invariant updates of 
Karampatziakis and Langford (2010) as extended to TD by Tian (in preparation). The usage technique 
of Mahmood (2017; Mahmood and Sutton, 2015) is probably also part of the solution. In the next 
section we consider an off-policy learning method that does not use importance sampling. 
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7.5 Off-policy Learning Without Importance Sampling: 

The n-step Tree Backup Algorithm 

Is off-policy learning possible without importance sampling? Q-learning and Expected Sarsa from 
Chapter 6 do this for the one-step case, but is there a corresponding multi-step algorithm? In this 
section we present just such an n-step method, called the tree-backup algorithm. 

The idea of the algorithm is suggested by the 3-step tree-backup backup diagram 
shown to the right. Down the central spine and labeled in the diagram are three sample 
states and rewards, and two sample actions. These are the random variables representing 
the events occurring after the initial state-action pair S t , A t . Hanging off to the sides 
of each state are the actions that were not selected. (For the last state, all the actions 
are considered to have not (yet) been selected.) Because we have no sample data for 
the unselected actions, we bootstrap and use the estimates of their values in forming 
the target for the update. This slightly extends the idea of an backup diagram. So 
far we have always updated the estimated value of the node at the top of the diagram 
toward a target combining the rewards along the way (appropriately discounted) and 
the estimated values of the nodes at the bottom. In the tree-backup update, the target 
includes all these things plus the estimated values of the dangling action nodes hanging 
off the sides, at all levels. This is why it is called a tree-backup update; it is an update 
from the entire tree of of estimated action values. 

More precisely, the update is from the estimated action values of the leaf nodes 
of the tree. The action nodes in the interior, corresponding to the actual actions 
taken, do not participate. Each leaf node contributes to the target with a weight 
proportional to its probability of occurring under the target policy 7r. Thus each 
first-level action a contributes with a weight of 7r(a|iSt+i), except that the action 
actually taken, A t + 1 , does not contribute at all. Its probability, ^(At+ilSt+i), is 
used to weight all the second-level action values. Thus, each non-selected second-level action a' 
contributes with weight Tr{A t +i |S't+i)7r(a , |5't+2). Each third-level action contributes with weight 
7r(A t+ i|S' t+ i)7r(A t+ 2|/S't+2)7r(a // |5't+3), and so on. It is as if each arrow to an action node in the di¬ 
agram is weighted by the action’s probability of being selected under the target policy and, if there is 
a tree below the action, then that weight applies to all the leaf nodes in the tree. 

We can think of the 3-step tree-backup update as consisting of 6 half-steps, alternating between 
sample half-steps from an action to a subsequent state, and expected half-steps considering from that 
state all possible actions with their probabilities of occuring under the policy. 

The one-step return (target) of the tree-backup algorithm is the same as that of Expected Sarsa. It 
can be written 

Gt-.t+i = Rt +1 + 7 7r (a|S , t+i)*2t(St+i> a) 

a 

= d't + Qt-i(St>A t ), 

where S' t is a modified form of the TD error from Expected Sarsa: 

b't = Rt+ 1 + 7 7r ( a l , S't+i)Qt(‘S't+i) fl) — Qt-i{St, At). (7.12) 

a 

With these, the general n-step returns of the tree-backup algorithm can be defined recursively, and then 
as a sum of TD errors: 

Gf.t+n = Rt+l + 7 7r ( a l‘^t+l)Qt(‘S't+l! a ) + l’ K i. J ^t+l\St+l)Gt+l-.t+n (7-13) 

a^A t +1 

= + Qt-i{St jAt) — 77r(A t+1 |S' t+1 )(5 ( (S' t+1 , A t+1 ) + 77r(A t+1 |S' t+1 )G' t+ i : ( +n 


St, A t 
-Rt+iJ 



the 3-step 
tree-backup 
update 
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— Qt-i{St> A t ) + 5' t + 77r(A t+ i |5’t+i)(Gt+i : t+ n - Qt(St+i, A t+ i)) 

= Qt-i(St, At) + S' t + 77r(A t+ i|St+i)(5( +1 + 7 2 7r(A t+ i \St+\)'K{At+2\St+2)S't+‘2 + ' • • 

min(t+n—1,T— 1) k 

= Qt-i{S t , A t ) + Y s 'k II l^{Ai\Si), 

k—t i—t -fl 


under the usual convention that a degenerate product with no factors is 1. This target is then used 
with the usual action-value update rule from n-step Sarsa: 

Qt+n{St, A t ) = Qt+n-l(S t , A t ) + a [Gf.t+n ~ Qt+n—l i At)\ , (7-5) 

while the values of all other state-action pairs remain unchanged: Q t+n (s,a) = Qt+ n -i(s,a), for all 
s,a such that s^S t or a^A t . Pseudocode for this algorithm is shown in the box. 


n-step Tree Backup for estimating Q ss g*, or Q ss q K for a given ir 


Initialize Q(s,a) arbitrarily, for all s £ §,a £ A 

Initialize 7r to be £-greedy with respect to Q , or as a fixed given policy 
Parameters: step size a € (0,1], small e > 0, a positive integer n 
All store and access operations can take their index mod n 

Repeat (for each episode): 

Initialize and store Sq ^ terminal 
Select and store an action A 0 ~ tt(-\S 0 ) 

Store Q(5o,A 0 ) as Q 0 
T <r~ oo 

For t = 0,1, 2,... : 

| If t < T: 

Take action A t 

Observe the next reward R\ observe and store the next state as S t +i 
If St +1 is terminal: 

T^t+1 
Store R — Q t as S t 
else: 

Store R + ^Y^ a 7r i a \ s t+i)Q(S t+ i,a) - Q t as S t 
Select arbitrarily and store an action as A t+1 
Store Q(S t+ i,A t+1 ) as Q t + i 
S tore 7r(A t+ i|5 t+ i) as 7r t+ i 

r t — n + 1 (t is the time whose estimate is being updated) 

If r > 0: 

Z «- 1 

g^q t 

For k = r,..., min(r + n — 1 ,T — 1): 

G^G + ZS k 
Z <- 7^7T fe+ l 

Q(Sr, At) <- Q(S t , At) +a[G- Q(S t , A t )} 

If 7r is being learned, then ensure that 7r(a|5 T ) is e-greedy wrt Q(S T , •) 
Until r = T — 1 
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7.6 *A Unifying Algorithm: n-step Q(a) 

So far in this chapter we have considered three different kinds of action-value algorithms, corresponding 
to the first three backup diagrams shown in Figure 7.5. n-step Sarsa has all sample transitions, the 
tree-backup algorithm has all state-to-action transitions fully branched without sampling, and n-step 
Expected Sarsa has all sample transitions except for the last state-to-action one, which is fully branched 
with an expected value. To what extent can these algorithms be unified? 

One idea for unification is suggested by the fourth backup diagram in Figure 7.5. This is the idea 
that one might decide on a step-by-step basis whether one wanted to take the action as a sample, as in 
Sarsa, or consider the expectation over all actions instead, as in the tree-backup update. Then, if one 
chose always to sample, one would obtain Sarsa, whereas if one chose never to sample, one would get 
the tree-backup algorithm. Expected Sarsa would be the case where one chose to sample for all steps 
except for the last one. And of course there would be many other possibilities, as suggested by the last 
diagram in the figure. To increase the possibilities even further we can consider a continuous variation 
between sampling and expectation. Let at £ [0,1] denote the degree of sampling on step t, with cr = 1 
denoting full sampling and <7 = 0 denoting a pure expectation with no sampling. The random variable 
a t might be set as a function of the state, action, or state-action pair at time t. We call this proposed 
new algorithm n-step Q(a). 


4-step 4-step 4-step 4-step 

Sarsa Tree backup Expected Sarsa Q{a) 



Figure 7.5: The backup diagrams of the three kinds of n-step action-value updates considered so far 
in this chapter (4-step case) plus the backup diagram of a fourth kind of update that unifies them all. 
The ‘p’s indicate half transitions on which importance sampling is required in the off-policy case. The 
fourth kind of update unifies all the others by choosing on a state-by-state basis whether to sample 
(a t = 1) or not ( a t = 0). 
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Now let us develop the equations of n-step Q(a). First note that the n-step return of Sarsa (7.4) can 
be written in terms of its own pure-sample-based TD error: 

min(t+n— 1,T— 1) 

Gt-.t+n = Qt-i(St, At) + * [Rk+i + "tQkiSk+i, Ak+i) ~ Qk-i(Sk, Ak)] 

k—t 

This suggests that we may be able to cover both cases if we generalize the TD error to slide with at 
from its expectation to its sampling form: 

St. = Rt+i + At+i) + (1 — at+i)Qt+i] — Qt—i{St, At), (7.14) 

with 

Qt = n(a\S t )Q t -i(S t , a), (7.15) 

a 

as usual. Using these we can define the n-step returns of Q(a) as: 

Gf.t+i = Rt.+i + j[<T t +iQt(S t+ i, At+ 1 ) + (1 — a t +i)Qt+i] 

= St + Qt.-i(St, A t ), 

Gf.t +2 = Rt- i-i + 'y\&t+iQt{St+i, A t+ i) + (1 — at+i)Qt+i\ 

— 7(1 — o’t+i)' K (At+i\St+i)Qt{St+i, A t+ i) 

+ 7(1 - a t+1 )ir(A t+1 \S t+ i)[R t +2 + ^[a t+2 Qt(St+ 2 , A t+2 ) + (1 - a t + 2 )Qt+ 2 ] ] 

— l&t+iQt{St+\, At+i) 

+ 7<7 t+ l [f? t+2 + ^[a t+ 2Qt(S t+2 , A t+2 ) + (1 - Cr t+2 )Qt+2]] 

= Qt-i(St, At) + St 

+ 7(1 — cr t+ 1 ) 7 r(A t+ 1 |S' (+ i)(S t+ i 

+ 7 f7 t+i'^t+i 

= Qt-l(St, At) + St + 7 [(1 — < T t+l) 7r ( y 4-* + l|'S'i+l) + <?t+ 1 ] <^t+i 

min(£+n—1,T—1) k 

Gf.t+n = Qt.-i(St, A t ) + Sk 7[(1 — <J i) 7r (A i \Si) + o’,]. (7.16) 

k—t i=t+ 1 


Under on-policy training, this return is ready to be used in an update such as that for n-step Sarsa 
(7.5). For the off-policy case, we need to take a into account in the importance sampling ratio, which 
we redefine more generally as 


min(h,T— 1 ) 

Pt:h = 

k—t 


( Tf{A k \S k ) 
V * b(A k \S k ) 


1 - Ofc 


(7.17) 


After this we can then use the usual general (off-policy) update for n-step Sarsa (7.9). A complete 
algorithm is given in the box on the next page. 
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Input: an arbitrary behavior policy b such that &(a|s) > 0, for all s £ S,a £ A 
Initialize Q(s,a) arbitrarily, for all s £ S,a £ A 

Initialize 7 r to be £-greedy with respect to Q, or as a fixed given policy 
Parameters: step size a € (0,1], small e > 0, a positive integer n 
All store and access operations can take their index mod n 


Repeat (for each episode): 

Initialize and store Sq ^ terminal 
Select and store an action A 0 ~ 6(-|5o) 

Store Q(Sq,Ao) as Q o 
T i— oo 

For t = 0,1,2,... : 

| If t<T: 

Take action A t 

Observe the next reward R\ observe and store the next state as S t + 1 
If St -|-i is terminal: 

T-^t + 1 
Store S t <— R — Qt 
else: 

Select and store an action A t +\ ~ 6(-|5 t+ i) 

Select and store a t +1 
Store Q(S t +i,A t+ i) as Q t+1 

Store R + 7 <j t+1 Q t+1 + 7(1 - a t+1 ) J2 a n(a\S t+1 )Q(S t+1 ,a) - Q t as 6 t 
Store 7 r(A t+ 1 | 5 i + i) as 7 r t+ i 


ctr .,.0 7r(A t +i|S t +i) 

Stme b(j4t+1 | St+1 ) as p t+ 1 


r t — n + 1 

If r > 0: 

P<~ 1 
Z <r- 1 
G <T- Qt 
For k = t, . 


(t is the time whose estimate is being updated) 


i(r + n — 1,T — 1): 


G<-G + Z8k 

z jZ[(l — Uk+l)^k+l + &k+l\ 

P <- p(l - G k + CTkPk) 

Q(S t , A t ) t— Q{S t , A t ) + exp [G — Q(S t , A t )\ 

If 7 t is being learned, then ensure that 7 r(a|S r ) is £-greedy wrt Q(S T , • 
Until t = T- 1 
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7.7 Summary 
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In this chapter we have developed a range of temporal-difference learning methods that lie in between 
the one-step TD methods of the previous chapter and the Monte Carlo methods of the chapter before. 
Methods that involve an intermediate amount of bootstrapping are important because they will typically 
perform better than either extreme. 

Our focus in this chapter has been on n-step methods, which look ahead to 
the next n rewards, states, and actions. The two 4-step backup diagrams to 
the right together summarize most of the methods introduced. The state-value 
update shown is for n-step TD with importance sampling, and the action-value 
update is for n-step Q (<r), which generalizes Expected Sarsa and Q-learning. 

All n-step methods involve a delay of n time steps before updating, as only 
then are all the required future events known. A further drawback is that they 
involve more computation per time step than previous methods. Compared 
to one-step methods, n-step methods also require more memory to record the 
states, actions, rewards, and sometimes other variables over the last n time 
steps. Eventually, in Chapter 12, we will see how multi-step TD methods can 
be implemented with minimal memory and computational complexity using 
eligibility traces, but there will always be some additional computation beyond 
one-step methods. Such costs can be well worth paying to escape the tyranny 
of the single time step. 

Although n-step methods are more complex than those using eligibility 
traces, they have the great benefit of being conceptually clear. We have sought 
to take advantage of this by developing two approaches to off-policy learning in 
the n-step case. One, based on importance sampling is conceptually simple but 
can be of high variance. If the target and behavior policies are very different 
it probably needs some new algorithmic ideas before it can be efficient and 
practical. The other, based on tree-backup updates, is the natural extension 
of Q-learning to the multi-step case with stochastic target policies. It involves 

no importance sampling but, again if the target and behavior policies are substantially different, the 
bootstrapping may span only a few steps even if n is large. 


T 


\ t 

! 

O 



(7=1 


(7 = 0 


4-step 

TD 


• • • 


4-step 

<2(c) 


Bibliographical and Historical Remarks 

7.1—2 The notion of n-step returns is due to Watkins (1989), who also first discussed their error 
reduction property. ?i-step algorithms were explored in the first edition of this book, in which 
they were treated as of conceptual interest, but not feasible in practice. The work of Cichosz 
(1995) and particularly van Seijen (2016) showed that they are actually completely practical 
algorithms. Given this, and their conceptual clarity and simplicity, we have chosen to highlight 
them here in the second edition. In particular, we now postpone all discussion of the backward 
view and of eligibility traces until Chapter 12. 

The results in the random walk examples were made for this text based on work of Sutton 
(1988) and Singh and Sutton (1996). The use of backup diagrams to describe these and other 
algorithms in this chapter is new. 

7.3—5 The developments in these sections are based on the work of Precup, Sutton, and Singh (2000), 
Precup, Sutton, and Dasgupta (2001), and Sutton, Mahmood, Precup, and van Hasselt (2014). 

The tree-backup algorithm is due to Precup, Sutton, and Singh (2000), but the presentation of 
it here is new. 
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The Q{cr) algorithm is new to this text, but has been explored further by De Asis, Hernandez- 
Garcia, Holland, and Sutton (2017). 



Chapter 8 


Planning and Learning with Tabular 
Methods 


In this chapter we develop a unified view of reinforcement learning methods that require a model of 
the environment, such as dynamic programming and heuristic search, and methods that can be used 
without a model, such as Monte Carlo and temporal-difference methods. These are respectively called 
model-based, and model-free reinforcement learning methods. Model-based methods rely on planning as 
their primary component, while model-free methods primarily rely on learning. Although there are real 
differences between these two kinds of methods, there are also great similarities. In particular, the heart 
of both kinds of methods is the computation of value functions. Moreover, all the methods are based on 
looking ahead to future events, computing a backed-up value, and then using it as an update target for 
an approximate value function. Earlier in this book we presented Monte Carlo and temporal-difference 
methods as distinct alternatives, then showed how they can be unified by n-step methods. Our goal in 
this chapter is a similar integration of model-based and model-free methods. Having established these 
as distinct in earlier chapters, we now explore the extent to which they can be intermixed. 


8.1 Models and Planning 

By a model of the environment we mean anything that an agent can use to predict how the environment 
will respond to its actions. Given a state and an action, a model produces a prediction of the resultant 
next state and next reward. If the model is stochastic, then there are several possible next states 
and next rewards, each with some probability of occurring. Some models produce a description of all 
possibilities and their probabilities; these we call distribution models. Other models produce just one 
of the possibilities, sampled according to the probabilities; these we call sample models. For example, 
consider modeling the sum of a dozen dice. A distribution model would produce all possible sums 
and their probabilities of occurring, whereas a sample model would produce an individual sum drawn 
according to this probability distribution. The kind of model assumed in dynamic programming— 
estimates of the MDP’s dynamics, p(s',r\s,a) —is a distribution model. The kind of model used in 
the blackjack example in Chapter 5 is a sample model. Distribution models are stronger than sample 
models in that they can always be used to produce samples. However, in many applications it is much 
easier to obtain sample models than distribution models. The dozen dice are a simple example of this. 
It would be easy to write a computer program to simulate the dice rolls and return the sum, but harder 
and more error-prone to figure out all the possible sums and their probabilities. 

Models can be used to mimic or simulate experience. Given a starting state and action, a sample 
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model produces a possible transition, and a distribution model generates all possible transitions weighted 
by their probabilities of occurring. Given a starting state and a policy, a sample model could produce 
an entire episode, and a distribution model could generate all possible episodes and their probabilities. 
In either case, we say the model is used to simulate the environment and produce simulated experience. 

The word planning is used in several different ways in different fields. We use the term to refer to any 
computational process that takes a model as input and produces or improves a policy for interacting 
with the modeled environment: 

model - P ' anning , policy 

In artificial intelligence, there are two distinct approaches to planning according to our definition. 
State-space planning , which includes the approach we take in this book, is viewed primarily as a search 
through the state space for an optimal policy or an optimal path to a goal. Actions cause transitions 
from state to state, and value functions are computed over states. In what we call plan-space planning , 
planning is instead a search through the space of plans. Operators transform one plan into another, and 
value functions, if any, are defined over the space of plans. Plan-space planning includes evolutionary 
methods and “partial-order planning,” a common kind of planning in artificial intelligence in which the 
ordering of steps is not completely determined at all stages of planning. Plan-space methods are difficult 
to apply efficiently to the stochastic sequential decision problems that are the focus in reinforcement 
learning, and we do not consider them further (but see, e.g., Russell and Norvig, 2010). 

The unified view we present in this chapter is that all state-space planning methods share a common 
structure, a structure that is also present in the learning methods presented in this book. It takes the 
rest of the chapter to develop this view, but there are two basic ideas: (1) all state-space planning 
methods involve computing value functions as a key intermediate step toward improving the policy, 
and (2) they compute value functions by updates or backup operations applied to simulated experience. 
This common structure can be diagrammed as follows: 


model 


simulated 

experience 


backups 


- values 


- policy 


Dynamic programming methods clearly fit this structure: they make sweeps through the space of 
states, generating for each state the distribution of possible transitions. Each distribution is then used 
to compute a backed-up value (update target) and update the state’s estimated value. In this chapter we 
argue that various other state-space planning methods also fit this structure, with individual methods 
differing only in the kinds of updates they do, the order in which they do them, and in how long the 
backed-up information is retained. 

Viewing planning methods in this way emphasizes their relationship to the learning methods that 
we have described in this book. The heart of both learning and planning methods is the estimation of 
value functions by backing-up update operations. The difference is that whereas planning uses simulated 
experience generated by a model, learning methods use real experience generated by the environment. 
Of course this difference leads to a number of other differences, for example, in how performance is 
assessed and in how flexibly experience can be generated. But the common structure means that many 
ideas and algorithms can be transferred between planning and learning. In particular, in many cases a 
learning algorithm can be substituted for the key update step of a planning method. Learning methods 
require only experience as input, and in many cases they can be applied to simulated experience just 
as well as to real experience. The box below shows a simple example of a planning method based 
on one-step tabular Q-learning and on random samples from a sample model. This method, which 
we call random-sample one-step tabular Q-planning , converges to the optimal policy for the model 
under the same conditions that one-step tabular Q-learning converges to the optimal policy for the real 
environment (each state-action pair must be selected an infinite number of times in Step 1, and a must 
decrease appropriately over time). 
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In addition to the unified view of planning and learning methods, a second theme in this chapter 
is the benefits of planning in small, incremental steps. This enables planning to be interrupted or 
redirected at any time with little wasted computation, which appears to be a key requirement for 
efficiently intermixing planning with acting and with learning of the model. Planning in very small 
steps may be the most efficient approach even on pure planning problems if the problem is too large to 
be solved exactly. 


8.2 Dyna: Integrating Planning, Acting, and Learning 

When planning is done on-line, while interacting with the environment, a number of interesting issues 
arise. New information gained from the interaction may change the model and thereby interact with 
planning. It may be desirable to customize the planning process in some way to the states or decisions 
currently under consideration, or expected in the near future. If decision making and model learning 
are both computation-intensive processes, then the available computational resources may need to be 
divided between them. To begin exploring these issues, in this section we present Dyna-Q, a simple 
architecture integrating the major functions needed in an on-line planning agent. Each function appears 
in Dyna-Q in a simple, almost trivial, form. In subsequent sections we elaborate some of the alternate 
ways of achieving each function and the trade-offs between them. For now, we seek merely to illustrate 
the ideas and stimulate your intuition. 

Within a planning agent, there are at least two roles for real experience: it can be used to improve the 
model (to make it more accurately match the real environment) and it can be used to directly improve 
the value function and policy using the kinds of reinforcement learning methods we have discussed in 
previous chapters. The former we call model-learning , and the latter we call direct reinforcement learning 
(direct RL). The possible relationships between experience, model, values, and policy are summarized 
in Figure 8.1. Each arrow shows a relationship of influence and presumed improvement. Note how 
experience can improve value functions and policies either directly or indirectly via the model. It is the 
latter, which is sometimes called indirect reinforcement learning , that is involved in planning. 




model 

learning 


Figure 8.1: Relationships among learning, planning, and acting. 
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Both direct and indirect methods have advantages and disadvantages. Indirect methods often make 
fuller use of a limited amount of experience and thus achieve a better policy with fewer environmental 
interactions. On the other hand, direct methods are much simpler and are not affected by biases in 
the design of the model. Some have argued that indirect methods are always superior to direct ones, 
while others have argued that direct methods are responsible for most human and animal learning. 
Related debates in psychology and artificial intelligence concern the relative importance of cognition as 
opposed to trial-and-error learning, and of deliberative planning as opposed to reactive decision making 
(see Chapter 14 for discussion of some of these issues from the perspective of psychology). Our view is 
that the contrast between the alternatives in all these debates has been exaggerated, that more insight 
can be gained by recognizing the similarities between these two sides than by opposing them. For 
example, in this book we have emphasized the deep similarities between dynamic programming and 
temporal-difference methods, even though one was designed for planning and the other for model-free 
learning. 

Dyna-Q includes all of the processes shown in Figure 8.1— planning, acting, model-learning, and 
direct RL—all occurring continuously. The planning method is the random-sample one-step tabular 
Q-planning method given in Figure 8.1. The direct RL method is one-step tabular Q-learning. The 
model-learning method is also table-based and assumes the environment is deterministic. After each 
transition St, At —> Rt+i, St+i, the model records in its table entry for St, At the prediction that 
Rt+i, St+i will deterministically follow. Thus, if the model is queried with a state-action pair that has 
been experienced before, it simply returns the last-observed next state and next reward as its prediction. 
During planning, the Q-planning algorithm randomly samples only from state-action pairs that have 
previously been experienced (in Step 1), so the model is never queried with a pair about which it has 
no information. 

The overall architecture of Dyna agents, of which the Dyna-Q algorithm is one example, is shown in 
Figure 8.2. The central column represents the basic interaction between agent and environment, giving 
rise to a trajectory of real experience. The arrow on the left of the figure represents direct reinforcement 
learning operating on real experience to improve the value function and the policy. On the right are 
model-based processes. The model is learned from real experience and gives rise to simulated experience. 
We use the term search control to refer to the process that selects the starting states and actions for the 


direct RL 
update 



Figure 8.2: The general Dyna Architecture. Real experience, passing back and forth between the environment 
and the policy, affects policy and value functions in much the same way as does simulated experience generated 
by the model of the environment. 
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simulated experiences generated by the model. Finally, planning is achieved by applying reinforcement 
learning methods to the simulated experiences just as if they had really happened. Typically, as in 
Dyna-Q, the same reinforcement learning method is used both for learning from real experience and 
for planning from simulated experience. The reinforcement learning method is thus the “final common 
path” for both learning and planning. Learning and planning are deeply integrated in the sense that 
they share almost all the same machinery, differing only in the source of their experience. 

Conceptually, planning, acting, model-learning, and direct RL occur simultaneously and in parallel 
in Dyna agents. For concreteness and implementation on a serial computer, however, we fully specify 
the order in which they occur within a time step. In Dyna-Q, the acting, model-learning, and direct 
RL processes require little computation, and we assume they consume just a fraction of the time. The 
remaining time in each step can be devoted to the planning process, which is inherently computation¬ 
intensive. Let us assume that there is time in each step, after acting, model-learning, and direct RL, 
to complete n iterations (Steps 1 -3) of the Q-planning algorithm. In the pseudocode algorithm for 
Dyna-Q in the box below, A Iodel(s, a) denotes the contents of the (predicted next state and reward) for 
state-action pair (s, a). Direct reinforcement learning, model-learning, and planning are implemented 
by steps (d), (e), and (f), respectively. If (e) and (f) were omitted, the remaining algorithm would be 
one-step tabular Q-learning. 


Tabular Dyna-Q 


Initialize Q(s, a) and Model(s, a) for all s £ § and a £ A(s) 

Do forever: 

(a) S <— current (nonterminal) state 

(b) A <- e-greedy (S, Q) 

(c) Execute action A; observe resultant reward, R, and state, S' 

(d) Q(S, A) <- Q{S, A) + a[R + 7 max„ Q(S', a) - Q{S, A)] 

(e) Model (S, A) ■£- R, S' (assuming deterministic environment) 

(f) Repeat n times: 

S ■£- random previously observed state 
A ■£- random action previously taken in S 
R, S' <- Model(S, A) 

Q(S, A) ^ Q(S,A)+a[R + 7 max a Q(S', a) - Q(S, A)} 


Example 8.1: Dyna Maze Consider the simple maze shown inset in Figure 8.3. In each of the 47 
states there are four actions, up, down, right, and left, which take the agent deterministically to the 
corresponding neighboring states, except when movement is blocked by an obstacle or the edge of the 
maze, in which case the agent remains where it is. Reward is zero on all transitions, except those into 
the goal state, on which it is +1. After reaching the goal state (G), the agent returns to the start state 
(S) to begin a new episode. This is a discounted, episodic task with 7 = 0.95. 

The main part of Figure 8.3 shows average learning curves from an experiment in which Dyna-Q 
agents were applied to the maze task. The initial action values were zero, the step-size parameter was 
a = 0.1, and the exploration parameter was e = 0.1. When selecting greedily among actions, ties were 
broken randomly. The agents varied in the number of planning steps, n, they performed per real step. 
For each n, the curves show the number of steps taken by the agent to reach the goal in each episode, 
averaged over 30 repetitions of the experiment. In each repetition, the initial seed for the random 
number generator was held constant across algorithms. Because of this, the first episode was exactly 
the same (about 1700 steps) for all values of n, and its data are not shown in the figure. After the first 
episode, performance improved for all values of n, but much more rapidly for larger values. Recall that 
the n = 0 agent is a nonplanning agent, using only direct reinforcement learning (one-step tabular Q- 
learning). This was by far the slowest agent on this problem, despite the fact that the parameter values 
(a and e) were optimized for it. The nonplanning agent took about 25 episodes to reach (e-)optimal 
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Episodes 


actions 


Figure 8.3: A simple maze (inset) and the average learning curves for Dyna-Q agents varying in their number 
of planning steps (n) per real step. The task is to travel from S to G as quickly as possible. 


performance, whereas the n = 5 agent took about five episodes, and the n = 50 agent took only three 
episodes. 

Figure 8.4 shows why the planning agents found the solution so much faster than the nonplanning 
agent. Shown are the policies found by the n = 0 and n = 50 agents halfway through the second episode. 
Without planning (n = 0), each episode adds only one additional step to the policy, and so only one 
step (the last) has been learned so far. With planning, again only one step is learned during the first 
episode, but here during the second episode an extensive policy has been developed that by the end of 
the episode will reach almost back to the start state. This policy is built by the planning process while 
the agent is still wandering near the start state. By the end of the third episode a complete optimal 
policy will have been found and perfect performance attained. 
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Figure 8.4: Policies found by planning and nonplanning Dyna-Q agents halfway through the second episode. 
The arrows indicate the greedy action in each state; if no arrow is shown for a state, then all of its action values 
were equal. The black square indicates the location of the agent. ■ 

In Dyna-Q, learning and planning are accomplished by exactly the same algorithm, operating on 
real experience for learning and on simulated experience for planning. Because planning proceeds 
incrementally, it is trivial to intermix planning and acting. Both proceed as fast as they can. The agent 
is always reactive and always deliberative, responding instantly to the latest sensory information and 
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yet always planning in the background. Also ongoing in the background is the model-learning process. 
As new information is gained, the model is updated to better match reality. As the model changes, the 
ongoing planning process will gradually compute a different way of behaving to match the new model. 

Exercise 8.1 The nonplanning method looks particularly poor in Figure 8.4 because it is a one-step 
method; a method using multi-step bootstrapping would do better. Do you think one of the multi-step 
bootstrapping methods from Chapter 7 could do as well as the Dyna method? Explain why or why not. 


8.3 When the Model Is Wrong 

In the maze example presented in the previous section, the changes in the model were relatively modest. 
The model started out empty, and was then filled only with exactly correct information. In general, we 
cannot expect to be so fortunate. Models may be incorrect because the environment is stochastic and 
only a limited number of samples have been observed, or because the model was learned using function 
approximation that has generalized imperfectly, or simply because the environment has changed and 
its new behavior has not yet been observed. When the model is incorrect, the planning process is likely 
to compute a suboptimal policy. 

In some cases, the suboptimal policy computed by planning quickly leads to the discovery and 
correction of the modeling error. This tends to happen when the model is optimistic in the sense of 
predicting greater reward or better state transitions than are actually possible. The planned policy 
attempts to exploit these opportunities and in doing so discovers that they do not exist. 

Example 8.2: Blocking Maze A maze example illustrating this relatively minor kind of modeling 
error and recovery from it is shown in Figure 8.5. Initially, there is a short path from start to goal, to 
the right of the barrier, as shown in the upper left of the figure. After 1000 time steps, the short path 
is “blocked,” and a longer path is opened up along the left-hand side of the barrier, as shown in upper 
right of the figure. The graph shows average cumulative reward for a Dyna-Q agent and an enhanced 
Dyna-Q-I- agent to be described shortly. The first part of the graph shows that both Dyna agents found 
the short path within 1000 steps. When the environment changed, the graphs become flat, indicating 
a period during which the agents obtained no reward because they were wandering around behind the 
barrier. After a while, however, they were able to find the new opening and the new optimal behavior. 




Figure 8.5: Average performance of Dyna agents on a blocking task. The left environment was used for the first 
1000 steps, the right environment for the rest. Dyna-Q+ is Dyna-Q with an exploration bonus that encourages 
exploration. ■ 
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Greater difficulties arise when the environment changes to become better than it was before, and yet 
the formerly correct policy does not reveal the improvement. In these cases the modeling error may not 
be detected for a long time, if ever, as we see in the next example. 

Example 8.3: Shortcut Maze The problem caused by this kind of environmental change is illus¬ 
trated by the maze example shown in Figure 8.6. Initially, the optimal path is to go around the left 
side of the barrier (upper left). After 3000 steps, however, a shorter path is opened up along the right 
side, without disturbing the longer path (upper right). The graph shows that the regular Dyna-Q agent 
never switched to the shortcut. In fact, it never realized that it existed. Its model said that there was 
no shortcut, so the more it planned, the less likely it was to step to the right and discover it. Even with 
an e-greedy policy, it is very unlikely that an agent will take so many exploratory actions as to discover 
the shortcut. 



Figure 8.6: Average performance of Dyna agents on a shortcut task. The left environment was used for the 
first 3000 steps, the right environment for the rest. ■ 


The general problem here is another version of the conflict between exploration and exploitation. 
In a planning context, exploration means trying actions that improve the model, whereas exploitation 
means behaving in the optimal way given the current model. We want the agent to explore to find 
changes in the environment, but not so much that performance is greatly degraded. As in the earlier 
exploration/exploitation conflict, there probably is no solution that is both perfect and practical, but 
simple heuristics are often effective. 

The Dyna-Q-I- agent that did solve the shortcut maze uses one such heuristic. This agent keeps track 
for each state-action pair of how many time steps have elapsed since the pair was last tried in a real 
interaction with the environment. The more time that has elapsed, the greater (we might presume) the 
chance that the dynamics of this pair has changed and that the model of it is incorrect. To encourage 
behavior that tests long-untried actions, a special “bonus reward” is given on simulated experiences 
involving these actions. In particular, if the modeled reward for a transition is r, and the transition has 
not been tried in r time steps, then planning updates are done as if that transition produced a reward 
of r + Ky/r, for some small k. This encourages the agent to keep testing all accessible state transitions 
and even to find long sequences of actions in order to carry out such tests. 1 Of course all this testing 

1 The Dyna-Q-)- agent was changed in two other ways as well. First, actions that had never been tried before from a 
state were allowed to be considered in the planning step (f) of the Tabular Dyna-Q algorithm in the box above. Second, 
the initial model for such actions was that they would lead back to the same state with a reward of zero. 


































8.4. PRIORITIZED SWEEPING 


139 


has its cost, but in many cases, as in the shortcut maze, this kind of computational curiosity is well 
worth the extra exploration. 

Exercise 8.2 Why did the Dyna agent with exploration bonus, Dyna-Q+, perform better in the first 
phase as well as in the second phase of the blocking and shortcut experiments? 

Exercise 8.3 Careful inspection of Figure 8.6 reveals that the difference between Dyna-Q+ and Dyna- 
Q narrowed slightly over the first part of the experiment. What is the reason for this? 

Exercise 8.4 (programming) The exploration bonus described above actually changes the estimated 
values of states and actions. Is this necessary? Suppose the bonus Ky/r was used not in updates, but 
solely in action selection. That is, suppose the action selected was always that for which Q(St,a) + 
Ky/r(St , a) was maximal. Carry out a gridworld experiment that tests and illustrates the strengths and 
weaknesses of this alternate approach. 


8.4 Prioritized Sweeping 

In the Dyna agents presented in the preceding sections, simulated transitions are started in state-action 
pairs selected uniformly at random from all previously experienced pairs. But a uniform selection is 
usually not the best; planning can be much more efficient if simulated transitions and updates are 
focused on particular state-action pairs. For example, consider what happens during the second episode 
of the first maze task (Figure 8.4). At the beginning of the second episode, only the state-action pair 
leading directly into the goal has a positive value; the values of all other pairs are still zero. This means 
that it is pointless to perform updates along almost all transitions, because they take the agent from 
one zero-valued state to another, and thus the updates would have no effect. Only an update along a 
transition into the state just prior to the goal, or from it, will change any values. If simulated transitions 
are generated uniformly, then many wasteful updates will be made before stumbling onto one of these 
useful ones. As planning progresses, the region of useful updates grows, but planning is still far less 
efficient than it would be if focused where it would do the most good. In the much larger problems that 
are our real objective, the number of states is so large that an unfocused search would be extremely 
inefficient. 

This example suggests that search might be usefully focused by working backward from goal states. 
Of course, we do not really want to use any methods specific to the idea of “goal state.” We want 
methods that work for general reward functions. Goal states are just a special case, convenient for 
stimulating intuition. In general, we want to work back not just from goal states but from any state 
whose value has changed. Suppose that the values are initially correct given the model, as they were in 
the maze example prior to discovering the goal. Suppose now that the agent discovers a change in the 
environment and changes its estimated value of one state, either up or down. Typically, this will imply 
that the values of many other states should also be changed, but the only useful one-step updates are 
those of actions that lead directly into the one state whose value has been changed. If the values of 
these actions are updated, then the values of the predecessor states may change in turn. If so, then 
actions leading into them need to be updated, and then their predecessor states may have changed. In 
this way one can work backward from arbitrary states that have changed in value, either performing 
useful updates or terminating the propagation. This general idea might be termed backward focusing 
of planning computations. 

As the frontier of useful updates propagates backward, it often grows rapidly, producing many state- 
action pairs that could usefully be updated. But not all of these will be equally useful. The values of 
some states may have changed a lot, whereas others may have changed little. The predecessor pairs 
of those that have changed a lot are more likely to also change a lot. In a stochastic environment, 
variations in estimated transition probabilities also contribute to variations in the sizes of changes 
and in the urgency with which pairs need to be updated. It is natural to prioritize the updates 
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according to a measure of their urgency, and perform them in order of priority. This is the idea behind 
prioritized sweeping. A queue is maintained of every state-action pair whose estimated value would 
change nontrivially if updated , prioritized by the size of the change. When the top pair in the queue is 
updated, the effect on each of its predecessor pairs is computed. If the effect is greater than some small 
threshold, then the pair is inserted in the queue with the new priority (if there is a previous entry of the 
pair in the queue, then insertion results in only the higher priority entry remaining in the queue). In 
this way the effects of changes are efficiently propagated backward until quiescence. The full algorithm 
for the case of deterministic environments is given in the box. 
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Example 8.5 Rod Maneuvering 


The objective in this task is to maneuver a rod around some awkwardly placed obstacles within 
a limited rectangular work space to a goal position in the fewest number of steps. The rod can 
be translated along its long axis or perpendicular to that axis, or it can be rotated in either 
direction around its center. The distance of each movement is approximately 1/20 of the work 
space, and the rotation increment is 10 degrees. Translations are deterministic and quantized to 
one of 20 x 20 positions. The figure below shows the obstacles and the shortest solution from 
start to goal, found by prioritized sweeping. 



This problem is deterministic, but has four actions and 14,400 potential states (some of these 
are unreachable because of the obstacles). This problem is probably too large to be solved with 
unprioritized methods. Figure reprinted from Moore and Atkeson (1993). ■ 


Extensions of prioritized sweeping to stochastic environments are straightforward. The model is 
maintained by keeping counts of the number of times each state-action pair has been experienced and 
of what the next states were. It is natural then to update each pair not with a sample update, as we 
have been using so far, but with an expected update, taking into account all possible next states and 
their probabilities of occurring. 

Prioritized sweeping is just one way of distributing computations to improve planning efficiency, and 
probably not the best way. One of prioritized sweeping’s limitations is that it uses expected updates, 
which in stochastic environments may waste lots of computation on low-probability transitions. As we 
show in the following section, sample updates can in many cases get closer to the true value function 
with less computation despite the variance introduced by sampling. Sample updates can win because 
they break the overall backing-up computation into smaller pieces—those corresponding to individual 
transitions—which then enables it to be focused more narrowly on the pieces that will have the largest 
impact. This idea was taken to what may be its logical limit in the “small backups” introduced by 
van Seijen and Sutton (2013). These are updates along a single transition, like a sample update, but 
based on the probability of the transition without sampling, as in an expected update. By selecting the 
order in which small updates are done it is possible to greatly improve planning efficiency beyond that 
possible with prioritized sweeping. 

We have suggested in this chapter that all kinds of state-space planning can be viewed as sequences 
of value updates, varying only in the type of update, expected or sample, large or small, and in the 
order in which the updates are done. In this section we have emphasized backward focusing, but this 
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is just one strategy. For example, another would be to focus on states according to how easily they 
can be reached from the states that are visited frequently under the current policy, which might be 
called forward focusing. Peng and Williams (1993) and Barto, Bradtke and Singh (1995) have explored 
versions of forward focusing, and the methods introduced in the next few sections take it to an extreme 
form. 


8.5 Expected vs. Sample Updates 

The examples in the previous sections give some idea of the range of possibilities for combining methods 
of learning and planning. In the rest of this chapter, we analyze some of the component ideas involved, 
starting with the relative advantages of expected and sample updates. 

Much of this book has been about different kinds of value-function updates, and we have considered 
a great many varieties. Focusing for the moment on one-step updates, they vary primarily along three 
binary dimensions. The first two dimensions are whether they update state values or action values 
and whether they estimate the value for the optimal policy or for an arbitrary given policy. These two 
dimensions give rise to four classes of updates for approximating the four value functions, g*, i>*, q„, and 
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Figure 8.7: Backup diagrams for all the one-step updates considered in this book. 
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tV- The other binary dimension is whether the updates are expected updates, considering all possible 
events that might happen, or sample updates, considering a single sample of what might happen. These 
three binary dimensions give rise to eight cases, seven of which correspond to specific algorithms, as 
shown in Figure 8.7. (The eighth case does not seem to correspond to any useful update.) Any of these 
one-step updates can be used in planning methods. The Dyna-Q agents discussed earlier use < 7 * sample 
updates, but they could just as well use < 7 * expected updates, or either expected or sample q n updates. 
The Dyna-AC system uses v v sample updates together with a learning policy structure. For stochastic 
problems, prioritized sweeping is always done using one of the expected updates. 

When we introduced one-step sample updates in Chapter 6 , we presented them as substitutes for 
expected updates. In the absence of a distribution model, expected updates are not possible, but sample 
updates can be done using sample transitions from the environment or a sample model. Implicit in that 
point of view is that expected updates, if possible, are preferable to sample updates. But are they? 
Expected updates certainly yield a better estimate because they are uncorrupted by sampling error, 
but they also require more computation, and computation is often the limiting resource in planning. 
To properly assess the relative merits of expected and sample updates for planning we must control for 
their different computational requirements. 

For concreteness, consider the expected and sample updates for approximating < 7 *, and the special 
case of discrete states and actions, a table-lookup representation of the approximate value function, Q, 
and a model in the form of estimated dynamics, p(s',r\s,a). The expected update for a state-action 
pair, s, a, is: 

Q(s,a) <- ^2p(s',r\s,a) 

s' ,r 

The corresponding sample update for s, a, given a sample next state and reward, S' and R (from the 
model), is the Q-learning-like update: 


r + 7 maxes', a 1 ) 


( 8 . 1 ) 


Q{s, a) <r- Q(s , a) + a R + 7 max Q(S ', a') — Q(s, a) 


( 8 . 2 ) 


where a is the usual positive step-size parameter. 

The difference between these expected and sample updates is significant to the extent that the 
environment is stochastic, specifically, to the extent that, given a state and action, many possible next 
states may occur with various probabilities. If only one next state is possible, then the expected and 
sample updates given above are identical (taking a = 1). If there are many possible next states, then 
there may be significant differences. In favor of the expected update is that it is an exact computation, 
resulting in a new Q(s,a) whose correctness is limited only by the correctness of the Q(s', a') at successor 
states. The sample update is in addition affected by sampling error. On the other hand, the sample 
update is cheaper computationally because it considers only one next state, not all possible next states. 
In practice, the computation required by update operations is usually dominated by the number of 
state-action pairs at which Q is evaluated. For a particular starting pair, s, a , let b be the branching 
factor (i.e., the number of possible next states, s', for which p(s' |s, a) > 0). Then an expected update 
of this pair requires roughly b times as much computation as a sample update. 

If there is enough time to complete an expected update, then the resulting estimate is generally better 
than that of b sample updates because of the absence of sampling error. But if there is insufficient time 
to complete a expected update, then sample updates are always preferable because they at least make 
some improvement in the value estimate with fewer than b updates. In a large problem with many 
state-action pairs, we are often in the latter situation. With so many state-action pairs, expected 
updates of all of them would take a very long time. Before that we may be much better off with a few 
sample updates at many state-action pairs than with expected updates at a few pairs. Given a unit 
of computational effort, is it better devoted to a few expected updates or to b times as many sample 
updates? 
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Number of maxQ(s',a') computations 
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Figure 8.8: Comparison of efficiency of expected and sample updates. 

Figure 8.8 shows the results of an analysis that suggests an answer to this question. It shows the 
estimation error as a function of computation time for expected and sample updates for a variety of 
branching factors, b. The case considered is that in which all b successor states are equally likely and 
in which the error in the initial estimate is 1. The values at the next states are assumed correct, so the 
expected update reduces the error to zero upon its completion. In this case, sample updates reduce the 

error according to where t is the number of sample updates that have been performed (assuming 

sample averages, i.e., a = 1/f). The key observation is that for moderately large b the error falls 
dramatically with a tiny fraction of b updates. For these cases, many state-action pairs could have 
their values improved dramatically, to within a few percent of the effect of an expected update, in the 
same time that a single state-action pair could undergo an expected update. 

The advantage of sample updates shown in Figure 8.8 is probably an underestimate of the real effect. 
In a real problem, the values of the successor states would be estimates that are themselves updated. 
By causing estimates to be more accurate sooner, sample updates will have a second advantage in that 
the values backed up from the successor states will be more accurate. These results suggest that sample 
updates are likely to be superior to expected updates on problems with large stochastic branching 
factors and too many states to be solved exactly. 


8.6 Trajectory Sampling 

In this section we compare two ways of distributing updates. The classical approach, from dynamic 
programming, is to perform sweeps through the entire state (or state-action) space, updating each state 
(or state-action pair) once per sweep. This is problematic on large tasks because there may not be 
time to complete even one sweep. In many tasks the vast majority of the states are irrelevant because 
they are visited only under very poor policies or with very low probability. Exhaustive sweeps implicitly 
devote equal time to all parts of the state space rather than focusing where it is needed. As we discussed 
in Chapter 4, exhaustive sweeps and the equal treatment of all states that they imply are not necessary 
properties of dynamic programming. In principle, updates can be distributed any way one likes (to 
assure convergence, all states or state-action pairs must be visited in the limit an infinite number of 
times; although an exception to this is discussed in Section 8.7 below), but in practice exhaustive sweeps 
are often used. 

The second approach is to sample from the state or state-action space according to some distribution. 
One could sample uniformly, as in the Dyna-Q agent, but this would suffer from some of the same 
problems as exhaustive sweeps. More appealing is to distribute updates according to the on-policy 
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distribution, that is, according to the distribution observed when following the current policy. One 
advantage of this distribution is that it is easily generated; one simply interacts with the model, following 
the current policy. In an episodic task, one starts in a start state (or according to the starting-state 
distribution) and simulates until the terminal state. In a continuing task, one starts anywhere and 
just keeps simulating. In either case, sample state transitions and rewards are given by the model, 
and sample actions are given by the current policy. In other words, one simulates explicit individual 
trajectories and performs updates at the state or state-action pairs encountered along the way. We call 
this way of generating experience and updates trajectory sampling. 

It is hard to imagine any efficient way of distributing updates according to the on-policy distribution 
other than by trajectory sampling. If one had an explicit representation of the on-policy distribution, 
then one could sweep through all states, weighting the update of each according to the on-policy dis¬ 
tribution, but this leaves us again with all the computational costs of exhaustive sweeps. Possibly one 
could sample and update individual state-action pairs from the distribution, but even if this could 
be done efficiently, what benefit would this provide over simulating trajectories? Even knowing the 
on-policy distribution in an explicit form is unlikely. The distribution changes whenever the policy 
changes, and computing the distribution requires computation comparable to a complete policy eval¬ 
uation. Consideration of such other possibilities makes trajectory sampling seem both efficient and 
elegant. 

Is the on-policy distribution of updates a good one? Intuitively it seems like a good choice, at least 
better than the uniform distribution. For example, if you are learning to play chess, you study positions 
that might arise in real games, not random positions of chess pieces. The latter may be valid states, 
but to be able to accurately value them is a different skill from evaluating positions in real games. 
We will also see in Part II that the on-policy distribution has significant advantages when function 
approximation is used. Whether or not function approximation is used, one might expect on-policy 
focusing to significantly improve the speed of planning. 

Focusing on the on-policy distribution could be beneficial because it causes vast, uninteresting parts 
of the space to be ignored, or it could be detrimental because it causes the same old parts of the space to 
be updated over and over. We conducted a small experiment to assess the effect empirically. To isolate 
the effect of the update distribution, we used entirely one-step expected tabular updates, as defined 
by (8.1). In the uniform case, we cycled through all state-action pairs, updating each in place, and 
in the on-policy case we simulated episodes, all starting in the same state, updating each state-action 
pair that occurred under the current e-greedy policy (e = 0.1). The tasks were undiscounted episodic 
tasks, generated randomly as follows. From each of the |S| states, two actions were possible, each of 
which resulted in one of b next states, all equally likely, with a different random selection of b states for 
each state-action pair. The branching factor, 6, was the same for all state-action pairs. In addition, 
on all transitions there was a 0.1 probability of transition to the terminal state, ending the episode. 
We used episodic tasks to get a clear measure of the quality of the current policy. At any point in the 
planning process one can stop and exhaustively compute v^(s o), the true value of the start state under 
the greedy policy, w, given the current action-value function Q, as an indication of how well the agent 
would do on a new episode on which it acted greedily (all the while assuming the model is correct). 

The upper part of Figure 8.9 shows results averaged over 200 sample tasks with 1000 states and 
branching factors of 1, 3, and 10. The quality of the policies found is plotted as a function of the 
number of expected updates completed. In all cases, sampling according to the on-policy distribution 
resulted in faster planning initially and retarded planning in the long run. The effect was stronger, and 
the initial period of faster planning was longer, at smaller branching factors. In other experiments, we 
found that these effects also became stronger as the number of states increased. For example, the lower 
part of Figure 8.9 shows results for a branching factor of 1 for tasks with 10,000 states. In this case the 
advantage of on-policy focusing is large and long-lasting. 

All of these results make sense. In the short term, sampling according to the on-policy distribution 
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Figure 8.9: Relative efficiency of updates distributed uniformly across the state space versus focused on sim¬ 
ulated on-policy trajectories, each starting in the same state. Results are for randomly generated tasks of two 
sizes and various branching factors, b. 


helps by focusing on states that are near descendants of the start state. If there are many states and 
a small branching factor, this effect will be large and long-lasting. In the long run, focusing on the 
on-policy distribution may hurt because the commonly occurring states all already have their correct 
values. Sampling them is useless, whereas sampling other states may actually perform some useful 
work. This presumably is why the exhaustive, unfocused approach does better in the long run, at least 
for small problems. These results are not conclusive because they are only for problems generated in 
a particular, random way, but they do suggest that sampling according to the on-policy distribution 
can be a great advantage for large problems, in particular for problems in which a small subset of the 
state-action space is visited under the on-policy distribution. 


8.7 Real-time Dynamic Programming 

Real-time dynamic programming , or RTDP, is an on-policy trajectory-sampling version of DP’s value- 
iteration algorithm. Because it is closely related to conventional sweep-based policy iteration, RTDP 
illustrates in a particularly clear way some of the advantages that on-policy trajectory sampling can 
provide. RTDP updates the values of states visited in actual or simulated trajectories by means of 
expected tabular value-iteration updates as defined by (4.10). It is basically the algorithm that produced 
the on-policy results shown in Figure 8.9. 
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The close connection between RTDP and conventional DP makes it possible to derive some theoretical 
results by adapting existing theory. RTDP is an example of an asynchronous DP algorithm as described 
in Section 4.5. Asynchronous DP algorithms are not organized in terms of systematic sweeps of the state 
set; they update state values in any order whatsoever, using whatever values of other states happen to 
be available. In RTDP, the update order is dictated by the order states are visited in real or simulated 
trajectories. 


If trajectories can start only from a designated set of start states, and if you are interested in 
the prediction problem for a given policy, then on-policy trajectory sampling allows the algorithm 
to completely skip states that cannot be reached by the given policy from any of the start states: 
unreachable states are irrelevant to the prediction problem. For a control problem, where the goal is 
to find an optimal policy instead of evaluating a given policy, there might well be states that cannot 
be reached by any optimal policy from any of the start states, and there is no need to specify optimal 
actions for these irrelevant states. What is needed is an optimal partial policy, meaning a policy that is 
optimal for the relevant states but can specify arbitrary actions, or even be undefined, for the irrelevant 
states (see the illustration below). 


Irrelevant States: 
unreachable from any start state 
under any optimal policy 


But finding such an optimal partial pol¬ 
icy with an on-policy trajectory-sampling 
control method, such as Sarsa (Sec¬ 
tion 6.4), in general requires visiting all 
state-action pairs—even those that will 
turn out to be irrelevant—an infinite num- Start States 
ber of times. This can be done, for 
example, by using exploring starts (Sec¬ 
tion 5.3). This is true for RTDP as well: 
for episodic tasks with exploring starts, 

RTDP is an asynchronous value-iteration 
algorithm that converges to optimal po¬ 
lices for discounted finite MDPs (and for 

the undiscounted case under certain conditions). Unlike the situation for a prediction problem, it is 
generally not possible to stop updating any state or state-action pair if convergence to an optimal policy 
is important. 



reachable from some start state 
under some optimal policy 


The most interesting result for RTDP is that for certain types of problems satisfying reasonable 
conditions, RTDP is guaranteed to find a policy that is optimal on the relevant states without visiting 
every state infinitely often, or even without visiting some states at all. Indeed, in some problems, only 
a small fraction of the states need to be visited. This can be a great advantage for problems with very 
large state sets, where even a single sweep may not be feasible. 

The tasks for which this result holds are undiscounted episodic tasks for MDPs with absorbing goal 
states that generate zero rewards, as described in Section 3.4. At every step of a real or simulated 
trajectory, RTDP selects a greedy action (breaking ties randomly) and applies the expected value- 
iteration update operation to the current state. It can also update the values of an arbitrary collection 
of other states at each step; for example, it can update the values of states visited in a limited-horizon 
look-ahead search from the current state. 


For these problems, with each episode beginning in a state randomly chosen from the set of start 
states, and ending at a goal state, RTDP converges (with probability one) to a policy that is optimal 
for all the relevant states 2 provided the following conditions are satisfied: 1) the initial value of every 
goal state is zero, 2) there exists at least one policy that guarantees that a goal state will be reached 
with probability one from any start state, 3) all rewards for transitions from non-goal states are strictly 
negative, and 4) all the initial values are equal to, or greater than, their optimal values (which can be 


“This policy might be stochastic because RTDP continues to randomly select among all the greedy actions. 
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satisfied by simply setting the initial values of all states to zero). This result was proved by Barto, 
Bradtke, and Singh (1995) by combining results for asynchronous DP with results about a heuristic 
search algorithm known as learning real-time A* due to Korf (1990). 

Tasks having these properties are examples of stochastic optimal path problems , which are usually 
stated in terms of cost minimization instead of as reward maximization, as we do here. Maximizing the 
negative returns in our version is equivalent to minimizing the costs of paths from a start state to a goal 
state. Examples of this kind of task are minimum-time control tasks, where each time step required to 
reach a goal produces a reward of —1, or problems like the Golf example in Section 3.5, whose objective 
is to hit the hole with the fewest strokes. 

Example 8.6: RTDP on the Racetrack The racetrack problem of Exercise 5.7 (page 91) is a 
stochastic optimal path problem. Comparing RTDP and the conventional DP value iteration algorithm 
on an example racetrack problem illustrates some of the advantages of on-policy trajectory sampling. 

Recall from the exercise that an agent has to learn how to drive a car around a turn like those shown 
in Figure 5.5 and cross the finish line as quickly as possible while staying on the track. Start states 
are all the zero-speed states on the starting line; the goal states are all the states that can be reached 
in one time step by crossing the finish line from inside the track. Unlike Exercise 5.7, here there is no 
limit on the car’s speed, so the state set is potentially infinite. However, the set of states that can be 
reached from the set of start states via any policy is finite and can be considered to be the state set 
of the problem. Each episode begins in a randomly selected start state and ends when the car crosses 
the finish line. The rewards are —1 for each step until the car crosses the finish line. If the car hits the 
track boundary, it is moved back to a random start state, and the episode continues. 

A racetrack similar to the small racetrack on the left of Figure 5.5 has 9,115 states reachable from 
start states by any policy, only 599 of which are relevant, meaning that they are reachable from some 
start state via some optimal policy. (The number of relevant states was estimated by counting the 
states visited while executing optimal actions for 10 7 episodes.) 

The table below compares solving this task by conventional DP and by RTDP. These results are 
averages over 25 runs, each begun with a different random number seed. Conventional DP in this case 
is value iteration using exhaustive sweeps of the state set, with values updated one state at a time in 
place, meaning that the update for each state uses the most recent values of the other states (This is 
the Gauss-Seidel version of value iteration, which was found to be approximately twice as fast as the 
Jacobi version on this problem. See Section 4.8.) No special attention was paid to the ordering of the 
updates; other orderings could have produced faster convergence. Initial values were all zero for each 
run of both methods. DP was judged to have converged when the maximum change in a state value 
over a sweep was less than 10 -4 , and RTDP was judged to have converged when the average time to 
cross the finish line over 20 episodes appeared to stabilize at an asymptotic number of steps. This 
version of RTDP updated only the value of the current state on each step. 



DP 

RTDP 

Average computation to convergence 

28 sweeps 

4000 episodes 

Average number of updates to convergence 

252,784 

127,600 

Average number of updates per episode 

— 

31.9 

% of states updated < 100 times 

— 

98.45 

% of states updated < 10 times 

— 

80.51 

% of states updated 0 times 

— 

3.18 


Both methods produced policies averaging between 14 and 15 steps to cross the finish line, but RTDP 
required only roughly half of the updates that DP did. This is the result of RTDP’s on-policy trajectory 
sampling. Whereas the value of every state was updated in each sweep of DP, RTDP focused updates 
on fewer states. In an average run, RTDP updated the values of 98.45% of the states no more than 100 
times and 80.51% of the states no more than 10 times; the values of about 290 states were not updated 
at all in an average run. ■ 



8 .8. PLANNING AT DECISION TIME 


149 


Another advantage of RTDP is that as the value function approaches the optimal value function i>*, 
the policy used by the agent to generate trajectories approaches an optimal policy because it is always 
greedy with respect to the current value function. This is in contrast to the situation in conventional 
value iteration. In practice, value iteration terminates when the value function changes by only a small 
amount in a sweep, which is how we terminated it to obain the results in the table above. At this point, 
the value function closely approximates v*, and a greedy policy is close to an optimal policy. However, 
it is possible that policies that are greedy with respect to the latest value function were optimal, or 
nearly so, long before value iteration terminates. (Recall from Chapter 4 that optimal policies can be 
greedy with respect to many different value functions, not just u*.) Checking for the emergence of 
an optimal policy before value iteration converges is not a part of the conventional DP algorithm and 
requires considerable additional computation. 

In the racetrack example, by running many test episodes after each DP sweep, with actions selected 
greedily according to the result of that sweep, it was possible to estimate the earliest point in the 
DP computation at which the approximated optimal evaluation function was good enough so that the 
corresponding greedy policy was nearly optimal. For this racetrack, a close-to-optimal policy emerged 
after 15 sweeps of value iteration, or after 136,725 value-iteration updates. This is considerably less 
than the 252,784 updates DP needed to converge to i>*, but sill more than the 127,600 updates RTDP 
required. 

Although these simulations are certainly not definitive comparisons of the RTDP with conven¬ 
tional sweep-based value iteration, they illustrate some of advantages of on-policy trajectory sampling. 
Whereas conventional value iteration continued to update the value of all the states, RTDP strongly 
focused on subsets of the states that were relevant to the problem’s objective. This focus became in¬ 
creasingly narrow as learning continued. Because the convergence theorem for RTDP applies to the 
simulations, we know that RTDP eventually would have focused only on relevant states, i.e., on states 
making up optimal paths. RTDP achieved nearly optimal control with about 50% of the computation 
required by sweep-based value iteration. 


8.8 Planning at Decision Time 

Planning can be used in at least two ways. The one we have considered so far in this chapter, typified 
by dynamic programming and Dyna, is to use planning to gradually improve a policy or value function 
on the basis of simulated experience obtained from a model (either a sample or a distribution model). 
Selecting actions is then a matter of comparing the current state’s action values obtained from a table 
in the tabular case we have thus far considered, or by evaluating a mathematical expression in the 
approximate methods we consider in Part II below. Well before an action is selected for any current 
state St, planning has played a part in improving the table entries, or the mathematical expression, 
needed to select the action for many states, including St- Used this way, planning is not focussed on 
the current state. We call planning used in this way background planning. 

The other way to use planning is to begin and complete it after encountering each new state St, as 
a computation whose output is the selection of a single action A t ; on the next step planning begins 
anew with S t +i to produce A t+1 , and so on. The simplest, and almost degenerate, example of this 
use of planning is when only state values are available, and an action is selected by comparing the 
values of model-predicted next states for each action (or by comparing the values of afterstates as in 
the tic-tac-toe example in Chapter 1). More generally, planning used in this way can look much deeper 
than one-step-ahead and evaluate action choices leading to many different predicted state and reward 
trajectories. Unlike the first use of planning, here planning focuses on a particular state. We call this 
decision-time planning. 

These two ways of thinking about planning—using simulated experience to gradually improve a policy 
or value function, or using simulated experience to select an action for the current state—can blend 
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together in natural and interesting ways, but they have tended to be studied separately, and that is a 
good way to first understand them. Let us now take a closer look at decision-time planning. 

Even when planning is only done at decision time, we can still view it, as we did in Section 8.1, as 
proceeding from simulated experience to updates and values, and ultimately to a policy. It is just that 
now the values and policy are specific to the current state and the action choices available there, so 
much so that the values and policy created by the planning process are typically discarded after being 
used to select the current action. In many applications this is not a great loss because there are very 
many states and we are unlikely to return to the same state for a long time. In general, one may want 
to do a mix of both: focus planning on the current state and store the results of planning so as to 
be that much farther along should one return to the same state later. Decision-time planning is most 
useful in applications in which fast responses are not required. In chess playing programs, for example, 
one may be permitted seconds or minutes of computation for each move, and strong programs may 
plan dozens of moves ahead within this time. On the other hand, if low latency action selection is the 
priority, then one is generally better off doing planning in the background to compute a policy that can 
then be rapidly applied to each newly encountered state. 


8.9 Heuristic Search 

The classical state-space planning methods in artificial intelligence are decision-time planning methods 
collectively known as heuristic search. In heuristic search, for each state encountered, a large tree of 
possible continuations is considered. The approximate value function is applied to the leaf nodes and 
then backed up toward the current state at the root. The backing up within the search tree is just the 
same as in the expected updates with maxes (those for and q *) discussed throughout this book. The 
backing up stops at the state-action nodes for the current state. Once the backed-up values of these 
nodes are computed, the best of them is chosen as the current action, and then all backed-up values 
are discarded. 

In conventional heuristic search no effort is made to save the backed-up values by changing the 
approximate value function. In fact, the value function is generally designed by people and never 
changed as a result of search. However, it is natural to consider allowing the value function to be 
improved over time, using either the backed-up values computed during heuristic search or any of the 
other methods presented throughout this book. In a sense we have taken this approach all along. Our 
greedy, e-greedy, and UCB (Section 2.7) action-selection methods are not unlike heuristic search, albeit 
on a smaller scale. For example, to compute the greedy action given a model and a state-value function, 
we must look ahead from each possible action to each possible next state, take into account the rewards 
and estimated values, and then pick the best action. Just as in conventional heuristic search, this 
process computes backed-up values of the possible actions, but does not attempt to save them. Thus, 
heuristic search can be viewed as an extension of the idea of a greedy policy beyond a single step. 

The point of searching deeper than one step is to obtain better action selections. If one has a 
perfect model and an imperfect action-value function, then in fact deeper search will usually yield 
better policies . 3 Certainly, if the search is all the way to the end of the episode, then the effect of the 
imperfect value function is eliminated, and the action determined in this way must be optimal. If the 
search is of sufficient depth k such that 7 * is very small, then the actions will be correspondingly near 
optimal. On the other hand, the deeper the search, the more computation is required, usually resulting 
in a slower response time. A good example is provided by Tesauro’s grandmaster-level backgammon 
player, TD-Gammon (Section 16.1). This system used TD learning to learn an afterstate value function 
through many games of self-play, using a form of heuristic search to make its moves. As a model, TD- 
Gannnon used a priori knowledge of the probabilities of dice rolls and the assumption that the opponent 


3 There are interesting exceptions to this. See, e.g., Pearl (1984). 
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always selected the actions that TD-Gammon rated as best for it. Tesauro found that the deeper the 
heuristic search, the better the moves made by TD-Gammon, but the longer it took to make each move. 
Backgammon has a large branching factor, yet moves must be made within a few seconds. It was only 
feasible to search ahead selectively a few steps, but even so the search resulted in significantly better 
action selections. 

We should not overlook the most obvious way in which heuristic search focuses updates: on the 
current state. Much of the effectiveness of heuristic search is due to its search tree being tightly focused 
on the states and actions that might immediately follow the current state. You may spend more of your 
life playing chess than checkers, but when you play checkers, it pays to think about checkers and about 
your particular checkers position, your likely next moves, and successor positions. No matter how you 
select actions, it is these states and actions that are of highest priority for updates and where you most 
urgently want your approximate value function to be accurate. Not only should your computation be 
preferentially devoted to imminent events, but so should your limited memory resources. In chess, for 
example, there are far too many possible positions to store distinct value estimates for each of them, but 
chess programs based on heuristic search can easily store distinct estimates for the millions of positions 
they encounter looking ahead from a single position. This great focusing of memory and computational 
resources on the current decision is presumably the reason why heuristic search can be so effective. 

The distribution of updates can be altered in similar ways to focus on the current state and its 
likely successors. As a limiting case we might use exactly the methods of heuristic search to construct 
a search tree, and then perform the individual, one-step updates from bottom up, as suggested by 
Figure 8.10. If the updates are ordered in this way and a tabular representation is used, then exactly 
the same overall update would be achieved as in depth-first heuristic search. Any state-space search can 
be viewed in this way as the piecing together of a large number of individual one-step updates. Thus, 
the performance improvement observed with deeper searches is not due to the use of multistep updates 
as such. Instead, it is due to the focus and concentration of updates on states and actions immediately 
downstream from the current state. By devoting a large amount of computation specifically relevant 
to the candidate actions, decision-time planning can produce better decisions than can be produced by 
relying on unfocused updates. 



Figure 8.10: Heuristic search can be implemented as a sequence of one-step updates (shown here outlined) 
backing up values from the leaf nodes toward the root. The ordering shown here is for a selective depth-first 
search. 
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8.10 Rollout Algorithms 


Rollout algorithms are decision-time planning algorithms based on Monte Carlo control applied to 
simulated trajectories that all begin at the current environment state. They estimate action values for 
a given policy by averaging the returns of many simulated trajectories that start with each possible 
action and then follow the given policy. When the action-value estimates are considered to be accurate 
enough, the action (or one of the actions) having the highest estimated value is executed, after which 
the process is carried out anew from the resulting next state. As explained by Tesauro and Galperin 
(1997), who experimented with rollout algorithms for playing backgammon, the term “rollout” comes 
from estimating the value of a backgammon position by playing out, i.e., “rolling out,” the position 
many times to the game’s end with randomly generated sequences of dice rolls, where the moves of both 
players are made by some fixed policy. 

Unlike the Monte Carlo control algorithms described in Chapter 5, the goal of a rollout algorithm is 
not to estimate a complete optimal action-value function, g t , or a complete action-value function, q n , 
for a given policy 7r. Instead, they produce Monte Carlo estimates of action values only for each current 
state and for a given policy usually called the rollout policy. As decision-time planning algorithms, 
rollout algorithms make immediate use of these action-value estimates, then discard them. This makes 
rollout algorithms relatively simple to implement because there is no need to sample outcomes for every 
state-action pair, and there is no need to approximate a function over either the state space or the 
state-action space. 

What then do rollout algorithms accomplish? The policy improvement theorem described in Sec¬ 
tion 4.2 tells us that given any two policies 7r and 7r' that are identical except that 7r'(s) = a / n(s) 
for some state s, if q K (s,a) > v 7r (s), then policy ir' is as good as, or better, than tt. Moreover, if 
the inequality is strict, then n' is in fact better than 7r. This applies to rollout algorithms where s 
is the current state and 7r is the rollout policy. Averaging the returns of the simulated trajectories 
produces estimates of q n (s, a') for each action a' G A(s). Then the policy that selects an action in s 
that maximizes these estimates and thereafter follows 7r is a good candidate for a policy that improves 
over 7r. The result is like one step of the policy-iteration algorithm of dynamic programming discussed 
in Section 4.3 (though it is more like one step of asynchronous value iteration described in Section 4.5 
because it changes the action for just the current state). 

In other words, the aim of a rollout algorithm is to improve upon the default policy; not to find 
an optimal policy. Experience has shown that rollout algorithms can be surprisingly effective. For 
example, Tesauro and Galperin (1997) were surprised by the dramatic improvements in backgammon 
playing ability produced by the rollout method. In some applications, a rollout algorithm can produce 
good performance even if the rollout policy is completely random. But clearly, the performance of the 
improved policy depends on the performance of the rollout policy and the accuracy of the Monte Carlo 
value estimates: the better the rollout policy and the more accurate the value estimates, the better the 
policy produced by a rollout algorithm is likely be. 

This involves important tradeoffs because better rollout policies typically mean that more time is 
needed to simulate enough trajectories to obtain good value estimates. As decision-time planning 
methods, rollout algorithms usually have to meet strict time constraints. The computation time needed 
by a rollout algorithm depends on the number of actions that have to be evaluated for each decision, 
the number of time steps in the simulated trajectories needed to obtain useful sample returns, the time 
it takes the rollout policy to make decisions, and the number of simulated trajectories needed to obtain 
good Monte Carlo action-value estimates. 

Balancing these factors is important in any application of rollout methods, though there are several 
ways to ease the challenge. Because the Monte Carlo trials are independent of one another, it is pos¬ 
sible to run many trials in parallel on separate processors. Another tact is to truncate the simulated 
trajectories short of complete episodes, correcting the truncated returns by means of a stored evalu- 
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ation function (which brings into play all that we have said about truncated returns and updates in 
the preceding chapters). It is also possible, as Tesauro and Galperin (1997) suggest, to monitor the 
Monte Carlo simulations and prune away candidate actions that are unlikely to turn out to be the 
best, or whose values are close enough to that of the current best that choosing them instead would 
make no real difference (though Tesauro and Galperin point out that this would complicate a parallel 
implementation). 

We do not ordinarily think of rollout algorithms as learning algorithms because they do not maintain 
long-term memories of values or policies. However, these algorithms take advantage of some of the 
features of reinforcement learning that we have emphasized in this book. As instances of Monte Carlo 
control, they estimate action values by averaging the returns of a collection of sample trajectories, in 
this case trajectories of simulated interactions with a sample model of the environment. In this way they 
are like reinforcement learning algorithms in avoiding the exhaustive sweeps of dynamic programming 
by trajectory sampling, and in avoiding the need for distribution models by relying on sample, instead 
of expected, updates. Finally, rollout algorithms take advantage of the policy improvement property 
by acting greedily with respect to the estimated action values. 


8.11 Monte Carlo Tree Search 

Monte Carlo Tree Search (MCTS) is a recent and strikingly successful example of decision-time planning. 
At is base, MCTS is a rollout algorithm as described above, but enhanced by the addition of a means 
for accumulating value estimates obtained from the Monte Carlo simulations in order to successively 
direct simulations toward more highly-rewarding trajectories. MCTS is largely responsible for the 
improvement in computer Go from a weak amateur level in 2005 to a grandmaster level (6 dan or more) 
in 2015. Many variations of the basic algorithm have been developed, including a variant that we discuss 
in Section 16.6 that was critical for the stunning 2016 victories of the program AlphaGo over an 18-time 
world champion Go player. MCTS has proved to be effective in a wide variety of competitive settings, 
including general game playing (e.g., see Finnsson and Bjornsson, 2008; Genesereth and Thielscher, 
2014), but it is not limited to games; it can be effective for single-agent sequential decision problems if 
there is an environment model simple enough for fast multistep simulation. 

MCTS is executed after encountering each new state to select the agent’s action for that state; it 
is executed again to select the action for the next state, and so on. As in a rollout algorithm, each 
execution is an iterative process that simulates many trajectories starting from the current state and 
running to a terminal state (or until discounting makes any further reward negligible as a contribution 
to the return). The core idea of MCTS is to successively focus multiple simulations starting at the 
current state by extending the initial portions of trajectories that have received high evaluations from 
earlier simulations. MCTS does not have to retain approximate value functions or policies from one 
action selection to the next, though in many implementations it retains selected action values likely to 
be useful for its next execution. 

For the most part, the actions in the simulated trajectories are generated using a simple policy, 
usually called a rollout policy as it is for simpler rollout algorithms. When both the rollout policy and 
the model do not require a lot of computation, many simulated trajectories can be generated in a short 
period of time. As in any tabular Monte Carlo method, the value of a state-action pair is estimated 
as the average of the (simulated) returns from that pair. Monte Carlo value estimates are maintained 
only for the subset of state-action pairs that are most likely to be reached in a few steps, which form a 
tree rooted at the current state, as illustrated in Figure 8.11. MCTS incrementally extends the tree by 
adding nodes representing states that look promising based on the results of the simulated trajectories. 
Any simulated trajectory will pass through the tree and then exit it at some leaf node. Outside the tree 
and at the leaf nodes the rollout policy is used for action selections, but at the states inside the tree 
something better is possible. For these states we have value estimates for of at least some of the actions, 
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so we can pick among them using an informed policy, called the tree policy , that balances exploration 
and exploitation. For example, the tree policy could select actions using an e-greedy or UCB selection 
rule (Chapter 2). 

In more detail, each iteration of a basic version of MCTS consists of the following four steps as 
illustrated in Figure 8.11: 

1. Selection. Starting at the root node, a tree policy based on the action values attached to the 
edges of the tree traverses the tree to select a leaf node. 

2. Expansion. On some iterations (depending on details of the application), the tree is expanded 
from the selected leaf node by adding one or more child nodes reached from the selected node via 
unexplored actions. 

3. Simulation. From the selected node, or from one of its newly-added child nodes (if any), sim¬ 
ulation of a complete episode is run with actions selected by the rollout policy. The result is a 
Monte Carlo trial with actions selected first by the tree policy and beyond the tree by the rollout 
policy. 

4. Backup. The return generated by the simulated episode is backed up to update, or to initialize, 
the action values attached to the edges of the tree traversed by the tree policy in this iteration 
of MCTS. No values are saved for the states and actions visited by the rollout policy beyond the 
tree. Figure 8.11 illustrates this by showing a backup from the terminal state of the simulated 
trajectory directly to the state-action node in the tree where the rollout policy began (though in 
general, the entire return over the simulated trajectory is backed up to this state-action node). 




Figure 8.11: Monte Carlo Tree Search. When the environment changes to a new state, MCTS executes as 
many iterations as possible before an action needs to be selected, incrementally building a tree whose root node 
represents the current state. Each iteration consists of the four operations Selection, Expansion (though 
possibly skipped on some iterations), Simulation, and Backup, as explained in the text and illustrated by the 
bold arrows in the trees. Adapted from Chaslot, Bakkes, Szita, and Spronck (2008). 
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MCTS continues executing these four steps, starting each time at the tree’s root node, until no more 
time is left, or some other computational resource is exhausted. Then, finally, an action from the 
root node (which still represents the current state of the environment) is selected according to some 
mechanism that depends on the accumulated statistics in the tree; for example, it may be an action 
having the largest action value of all the actions available from the root state, or perhaps the action 
with the largest visit count to avoid selecting outliers. This is the action MCTS actually selects. After 
the environment transitions to a new state, MCTS is run again, sometimes starting with a tree of a 
single root node representing the new state, but often starting with a tree containing any descendants 
of this node left over from the tree constructed by the previous execution of MCTS; all the remaining 
nodes are discarded, along with the action values associated with them. 

MCTS was first proposed to select moves in programs playing two-person competitive games, such as 
Go. For game playing, each simulated episode is one complete play of the game in which both players 
select actions by the tree and rollout policies. Section 16.6 describes an extension of MCTS used in the 
AlphaGo program that combines the Monte Carlo evaluations of MCTS with action values learned by 
a deep ANN via self-play reinforcement learning. 

Relating MCTS to the reinforcement learning principles we describe in this book provides some 
insight into how it achieves such impressive results. At its base, MCTS is a decision-time planning 
algorithm based on Monte Carlo control applied to simulations that start from the root state; that is, 
it is a kind of rollout algorithm as described in the previous section. It therefore benefits from online, 
incremental, sample-based value estimation and policy improvement. Beyond this, it saves action-value 
estimates attached to the tree edges and updates them using reinforcement learning’s sample updates. 
This has the effect of focusing the Monte Carlo trials on trajectories whose initial segments are common 
to high-return trajectories previously simulated. Further, by incrementally expanding the tree, MCTS 
effectively grows a lookup table to store a partial action-value function, with memory allocated to the 
estimated values of state-action pairs visited in the initial segments of high-yielding sample trajectories. 
MCTS thus avoids the problem of globally approximating an action-value function while it retrains the 
benefit of using past experience to guide exploration. 

The striking success of decision-time planning by MCTS has deeply influenced artificial intelligence, 
and many researchers are studying modifications and extensions of the basic procedure for use in both 
games and single-agent applications. 


8.12 Summary of the Chapter 

Planning requires a model of the environment. A distribution model consists of the probabilities of 
next states and rewards for possible actions; a sample model produces single transitions and rewards 
generated according to these probabilities. Dynamic programming requires a distribution model because 
it uses expected updates, which involve computing expectations over all the possible next states and 
rewards. A sample model , on the other hand, is what is needed to simulate interacting with the 
environment during which sample updates, like those used by many reinforcement learning algorithms, 
can be used. Sample models are generally much easier to obtain than distribution models. 

We have presented a perspective emphasizing the surprisingly close relationships between planning 
optimal behavior and learning optimal behavior. Both involve estimating the same value functions, 
and in both cases it is natural to update the estimates incrementally, in a long series of small backing- 
up operations. This makes it straightforward to integrate learning and planning processes simply by 
allowing both to update the same estimated value function. In addition, any of the learning methods can 
be converted into planning methods simply by applying them to simulated (model-generated) experience 
rather than to real experience. In this case learning and planning become even more similar; they are 
possibly identical algorithms operating on two different sources of experience. 
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It is straightforward to integrate incremental planning methods with acting and model-learning. 
Planning, acting, and model-learning interact in a circular fashion (Figure 8.1), each producing what 
the other needs to improve; no other interaction among them is either required or prohibited. The 
most natural approach is for all processes to proceed asynchronously and in parallel. If the processes 
must share computational resources, then the division can be handled almost arbitrarily— by whatever 
organization is most convenient and efficient for the task at hand. 

In this chapter we have touched upon a number of dimensions of variation among state-space planning 
methods. One dimension is the variation in the size of updates. The smaller the updates, the more 
incremental the planning methods can be. Among the smallest updates are one-step sample updates, 
as in Dyna. Another important dimension is the distribution of updates, that is, of the focus of search. 
Prioritized sweeping focuses backward on the predecessors of states whose values have recently changed. 
On-policy trajectory sampling focuses on states or state-action pairs that the agent is likely to encounter 
when controlling its environment. This can allow computation to skip over parts of the state space that 
are irrelevant to the prediction or control problem. Real-time dynamic programming, an on-policy 
trajectory sampling version of value iteration, illustrates some of the advantages this strategy has over 
conventional sweep-based policy iteration. 

Planning can also focus forward from pertinent states, such as states actually encountered during an 
agent-environment interaction. The most important form of this is when planning is done at decision 
time, that is, as part of the action-selection process. Classical heuristic search as studied in artificial 
intelligence is an example of this. Other examples are rollout algorithms and Monte Carlo Tree Search 
that benefit from online, incremental, sample-based value estimation and policy improvement. 


8.13 Summary of Part I: Dimensions 

This chapter concludes Part I of this book. In it we have tried to present reinforcement learning not as 
a collection of individual methods, but as a coherent set of ideas cutting across methods. Each idea can 
be viewed as a dimension along which methods vary. The set of such dimensions spans a large space 
of possible methods. By exploring this space at the level of dimensions we hope to obtain the broadest 
and most lasting understanding. In this section we use the concept of dimensions in method space to 
recapitulate the view of reinforcement learning developed so far in this book. 

All of the methods we have explored so far in this book have three key ideas in common: first, 
they all seek to estimate value functions; second, they all operate by backing up values along actual or 
possible state trajectories; and third, they all follow the general strategy of generalized policy iteration 
(GPI), meaning that they maintain an approximate value function and an approximate policy, and they 
continually try to improve each on the basis of the other. These three ideas are central to the subjects 
covered in this book. We suggest that value functions, backing-up value updates, and GPI are powerful 
organizing principles potentially relevant to any model of intelligence, whether artificial or natural. 

Two of the most important dimensions along which the methods vary are shown in Figure 8.12. These 
dimensions have to do with the kind of update used to improve the value function. The horizontal 
dimension is whether they are sample updates (based on a sample trajectory) or expected updates 
(based on a distribution of possible trajectories). Expected updates require a distribution model, 
whereas sample updates need only a sample model, or can be done from actual experience with no 
model at all (another dimension of variation). The vertical dimension of Figure 8.12 corresponds to the 
depth of updates, that is, to the degree of bootstrapping. At three of the four corners of the space are 
the three primary methods for estimating values: DP, TD, and Monte Carlo. Along the left edge of the 
space are the sample-update methods, ranging from one-step TD updates to full-return Monte Carlo 
updates. Between these is a spectrum including methods based on n-step updates (and in Chapter 12 
we will extend this to mixtures of n-step updates such as the A-updates implemented by eligibility 
traces). 
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Figure 8.12: A slice through the space of reinforcement learning methods, highlighting the two of the most 
important dimensions explored in Part I of this book: the depth and width of the updates. 


DP methods are shown in the extreme upper-right corner of the space because they involve one-step 
expected updates. The lower-right corner is the extreme case of expected updates so deep that they run 
all the way to terminal states (or, in a continuing task, until discounting has reduced the contribution of 
any further rewards to a negligible level). This is the case of exhaustive search. Intermediate methods 
along this dimension include heuristic search and related methods that search and update up to a 
limited depth, perhaps selectively. There are also methods that are intermediate along the horizontal 
dimension. These include methods that mix expected and sample updates, as well as the possibility of 
methods that mix samples and distributions within a single update. The interior of the square is filled 
in to represent the space of all such intermediate methods. 

A third dimension that we have emphasized in this book is the binary distinction between on-policy 
and off-policy methods. In the former case, the agent learns the value function for the policy it is 
currently following, whereas in the latter case it learns the value function for the policy for a different 
policy, often the one that the agent currently thinks is best. The policy generating behavior is typically 
different from what is currently thought best because of the need to explore. This third dimension 
might be visualized as perpendicular to the plane of the page in Figure 8.12. 

In addition to the three dimensions just discussed, we have identified a number of others throughout 
the book: 

Definition of return Is the task episodic or continuing, discounted or undiscounted? 

Action values vs. state values vs. afterstate values What kind of values should be estimated? 
If only state values are estimated, then either a model or a separate policy (as in actor-critic 
methods) is required for action selection. 

Action selection/exploration How are actions selected to ensure a suitable trade-off between ex- 
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ploration and exploitation? We have considered only the simplest ways to do this: £-greedy, 
optimistic initialization of values, softmax, and upper confidence bound. 

Synchronous vs. asynchronous Are the updates for all states performed simultaneously or one by 
one in some order? 

Real vs. simulated Should one update based on real experience or simulated experience? If both, 
how much of each? 

Location of updates What states or state-action pairs should be updated? Model-free methods 
can choose only among the states and state-action pairs actually encountered, but model-based 
methods can choose arbitrarily. There are many possibilities here. 

Timing of updates Should updates be done as part of selecting actions, or only afterward? 

Memory for updates How long should updated values be retained? Should they be retained perma¬ 
nently, or only while computing an action selection, as in heuristic search? 

Of course, these dimensions are neither exhaustive nor mutually exclusive. Individual algorithms differ 
in many other ways as well, and many algorithms lie in several places along several dimensions. For 
example, Dyna methods use both real and simulated experience to affect the same value function. It is 
also perfectly sensible to maintain multiple value functions computed in different ways or over different 
state and action representations. These dimensions do, however, constitute a coherent set of ideas for 
describing and exploring a wide space of possible methods. 

The most important dimension not mentioned here, and not covered in Part I of this book, is 
that of function approximation. Function approximation can be viewed as an orthogonal spectrum of 
possibilities ranging from tabular methods at one extreme through state aggregation, a variety of linear 
methods, and then a diverse set of nonlinear methods. This dimension is explored in Part II. 


Bibliographical and Historical Remarks 

8.1 The overall view of planning and learning presented here has developed gradually over a number 
of years, in part by the authors (Sutton, 1990, 1991a, 1991b; Barto, Bradtke, and Singh, 1991, 
1995; Sutton and Pinette, 1985; Sutton and Barto, 1981b); it has been strongly influenced by 
Agre and Chapman (1990; Agre 1988), Bcrtsekas and Tsitsiklis (1989), Singh (1993), and others. 
The authors were also strongly influenced by psychological studies of latent learning (Tolman, 
1932) and by psychological views of the nature of thought (e.g., Galanter and Gerstenhaber, 
1956; Craik, 1943; Campbell, 1960; Dennett, 1978). In the Part III of the book, Section 14.6 
relates model-based and model-free methods to psychological theories of learning and behavior, 
and Section 15.11 discusses ideas about how the brain might implement these types of methods. 

8.2 The terms direct and indirect , which we use to describe different kinds of reinforcement learning, 
are from the adaptive control literature (e.g., Goodwin and Sin, 1984), where they are used to 
make the same kind of distinction. The term system identification is used in adaptive control 
for what we call model-learning (e.g., Goodwin and Sin, 1984; Ljung and Soderstrom, 1983; 
Young, 1984). The Dyna architecture is due to Sutton (1990), and the results in this and the 
next section are based on results reported there. Barto and Singh (1991) consider some of the 
issues in comparing direct and indirect reinforcement learning methods. 

There have been several works with model-based reinforcement learning that take the idea of 
exploration bonuses and optimistic initialization to its logical extreme, in which all incompletely 


8.3 



159 


explored choices are assumed maximally rewarding and optimal paths are computed to test 
them. The E 3 algorithm of Kearns and Singh (2002) and the R-max algorithm of Brafman and 
Tennenholtz (2003) are guaranteed to find a near-optimal solution in time polynomial in the 
number of states and actions. This is usually too slow for practical algorithms but is probably 
the best that can be done in the worst case. 

8.4 Prioritized sweeping was developed simultaneously and independently by Moore and Atkeson 
(1993) and Peng and Williams (1993). The results in the box on page 140 are due to Peng 
and Williams (1993). The results in the box on page 141 are due to Moore and Atkeson. Key 
subsequent work in this area includes that by McMahan and Gordon (2005) and by van Seijen 
and Sutton (2013). 

8.5 This section was strongly influenced by the experiments of Singh (1993). 

8 . 6—7 Trajectory sampling has implicitly been a part of reinforcement learning from the outset, but 
it was most explicitly emphasized by Barto, Bradtke, and Singh (1995) in their introduction 
of RTDP. They recognized that Korf’s (1990) learning real-time A * (LRTA*) algorithm is an 
asynchronous DP algorithm that applies to stochastic problems as well as the deterministic 
problems on which Korf focused. Beyond LRTA*, RTDP includes the option of updating the 
values of many states in the time intervals between the execution of actions. Barto et al. (1995) 
proved the convergence result described here by combining Korf’s (1990) convergence proof 
for LRTA* with the result of Bertsekas (1982) (also Bertsekas and Tsitsiklis, 1989) ensuring 
convergence of asynchronous DP for stochastic shortest path problems in the undiscounted 
case. Combining model-learning with RTDP is called Adaptive RTDP, also presented by Barto 
et al. (1995) and discussed by Barto (2011). 

8.9 For further reading on heuristic search, the reader is encouraged to consult texts and surveys 
such as those by Russell and Norvig (2009) and Korf (1988). Peng and Williams (1993) explored 
a forward focusing of updates much as is suggested in this section. 

8.10 Abramson’s (1990) expected-outcome model is a rollout algorithm applied to two-person games 
in which the play of both simulated players is random. He argued that even with random play, 
it is a “powerful heuristic” that is “precise, accurate, easily estimable, efficiently calculable, and 
domain-independent.” Tesauro and Galperin (1997) demonstrated the effectiveness of rollout 
algorithms for improving the play of backgammon programs, adopting the term “rollout” from 
its use in evaluating backgammon positions by playing out positions with different randomly 
generating sequences of dice rolls. Bertsekas, Tsitsiklis, and Wu (1997) examine rollout algo¬ 
rithms applied to combinatorial optimization problems, and Bertsekas (2013) surveys their use 
in discrete deterministic optimization problems, remarking that they are “often surprisingly 
effective.” 

8.11 The central ideas of MCTS were introduced by Coulom (2006) and by Kocsis and Szepesvari 
(2006). They built upon previous research with Monte Carlo planning algorithms as reviewed 
by these authors. Browne, Powley, Whitehouse, Lucas, Cowling, Rohlfshagen, Tavener, Perez, 
Samothrakis, and Colton (2012) is an excellent survey of MCTS methods and their applications. 
David Silver contributed to the ideas and presentation in this section. 



Part II: Approximate Solution Methods 


In the second part of the book we extend the tabular methods presented in the first part to apply 
to problems with arbitrarily large state spaces. In many of the tasks to which we would like to apply 
reinforcement learning the state space is combinatorial and enormous; the number of possible camera 
images, for example, is much larger than the number of atoms in the universe. In such cases we cannot 
expect to find an optimal policy or the optimal value function even in the limit of infinite time and 
data; our goal instead is to find a good approximate solution using limited computational resources. In 
this part of the book we explore such approximate solution methods. 

The problem with large state spaces is not just the memory needed for large tables, but the time and 
data needed to fill them accurately. In many of our target tasks, almost every state encountered will 
never have been seen before. To make sensible decisions in such states it is necessary to generalize from 
previous encounters with different states that are in some sense similar to the current one. In other 
words, the key issue is that of generalization. How can experience with a limited subset of the state 
space be usefully generalized to produce a good approximation over a much larger subset? 

Fortunately, generalization from examples has already been extensively studied, and we do not need 
to invent totally new methods for use in reinforcement learning. To some extent we need only combine 
reinforcement learning methods with existing generalization methods. The kind of generalization we 
require is often called function approximation because it takes examples from a desired function (e.g., 
a value function) and attempts to generalize from them to construct an approximation of the entire 
function. Function approximation is an instance of supervised learning , the primary topic studied in 
machine learning, artificial neural networks, pattern recognition, and statistical curve fitting. In theory, 
any of the methods studied in these fields can be used in the role of function approximator within 
reinforcement learning algorithms, although in practice some fit more easily into this role than others. 

Reinforcement learning with function approximation involves a number of new issues that do not 
normally arise in conventional supervised learning, such as nonstationarity, bootstrapping, and delayed 
targets. We introduce these and other issues successively over the five chapters of this part. Initially we 
restrict attention to on-policy training, treating in Chapter 9 the prediction case, in which the policy 
is given and only its value function is approximated, and then in Chapter 10 the control case, in which 
an approximation to the optimal policy is found. The challenging problem of off-policy learning with 
function approximation is treated in Chapter 11. In each of these three chapters we will have to return to 
first principles and re-examine the objectives of the learning to take into account function approximation. 
Chapter 12 introduces and analyzes the algorithmic mechanism of eligibility traces , which dramatically 
improves the computational properties of multi-step reinforcement learning methods in many cases. 
The final chapter of this part explores a different approach to control, policy-gradient methods, which 
approximate the optimal policy directly and need never form an approximate value function (although 
they may be much more efficient if they do approximate a value function as well the policy). 
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Chapter 9 


On-policy Prediction with 
Approximation 


In this chapter, we begin our study of function approximation in reinforcement learning by considering 
its use in estimating the state-value function from on-policy data, that is, in approximating v„ from 
experience generated using a known policy 7 r. The novelty in this chapter is that the approximate 
value function is represented not as a table but as a parameterized functional form with weight vector 
w € R d . We will write h(s,w) ss v n (s) for the approximate value of state s given weight vector w. For 
example, v might be a linear function in features of the state, with w the vector of feature weights. 
More generally, v might be the function computed by a multi-layer artificial neural network, with w the 
vector of connection weights in all the layers. By adjusting the weights, any of a wide range of different 
functions can be implemented by the network. Or v might be the function computed by a decision tree, 
where w is all the numbers defining the split points and leaf values of the tree. Typically, the number 
of weights (the dimensionality of w) is much less than the number of states (d <C |S|), and changing one 
weight changes the estimated value of many states. Consequently, when a single state is updated, the 
change generalizes from that state to affect the values of many other states. Such generalization makes 
the learning potentially more powerful but also potentially more difficult to manage and understand. 

Perhaps surprisingly, extending reinforcement learning to function approximation also makes it ap¬ 
plicable to partially observable problems, in which the full state is not available to the agent. If the 
parameterized function form for v does not allow the estimated value to depend on certain aspects of 
the state, then it is just as if those aspects are unobservable. In fact, all the theoretical results for 
methods using function approximation presented in this part of the book apply equally well to cases of 
partial observability. What function approximation can’t do, however, is augment the state represen¬ 
tation with memories of past observations. Some such possible further extensions are discussed briefly 
in Section 17.3. 


9.1 Value-function Approximation 

All of the prediction methods covered in this book have been described as updates to an estimated 
value function that shift its value at particular states toward a “backed-up value,” or update target , 
for that state. Let us refer to an individual update by the notation s 1 —> u, where s is the state 
updated and u is the update target that s’s estimated value is shifted toward. For example, the Monte 
Carlo update for value prediction is St >->• Gt , the TD(0) update is S t <—> Rt+i + 7 u(>St+i,w ( ), and 
the n-step TD update is S t 1 —> In the DP (dynamic programming) policy-evaluation update, 
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s i y E,r[I? t+ i + 7t)(S' t+ i,w i ) | 5( = s], an arbitrary state s is updated, whereas in the other cases the 
state encountered in actual experience, St, is updated. 

It is natural to interpret each update as specifying an example of the desired input-output behavior 
of the value function. In a sense, the update s H > u means that the estimated value for state s should 
be more like the update target u. Up to now, the actual update has been trivial: the table entry for 
s’s estimated value has simply been shifted a fraction of the way toward u , and the estimated values of 
all other states were left unchanged. Now we permit arbitrarily complex and sophisticated methods to 
implement the update, and updating at s generalizes so that the estimated values of many other states 
are changed as well. Machine learning methods that learn to mimic input-output examples in this 
way are called supervised learning methods, and when the outputs are numbers, like u, the process is 
often called function approximation. Function approximation methods expect to receive examples of the 
desired input-output behavior of the function they are trying to approximate. We use these methods 
for value prediction simply by passing to them the s H > g of each update as a training example. We 
then interpret the approximate function they produce as an estimated value function. 

Viewing each update as a conventional training example in this way enables us to use any of a 
wide range of existing function approximation methods for value prediction. In principle, we can 
use any method for supervised learning from examples, including artificial neural networks, decision 
trees, and various kinds of multivariate regression. However, not all function approximation methods 
are equally well suited for use in reinforcement learning. The most sophisticated neural network and 
statistical methods all assume a static training set over which multiple passes are made. In reinforcement 
learning, however, it is important that learning be able to occur on-line, while the agent interacts with 
its environment or with a model of its environment. To do this requires methods that are able to learn 
efficiently from incrementally acquired data. In addition, reinforcement learning generally requires 
function approximation methods able to handle nonstationary target functions (target functions that 
change over time). For example, in control methods based on GPI (generalized policy iteration) we often 
seek to learn q n while n changes. Even if the policy remains the same, the target values of training 
examples are nonstationary if they are generated by bootstrapping methods (DP and TD learning). 
Methods that cannot easily handle such nonstationarity are less suitable for reinforcement learning. 


9.2 The Prediction Objective (VE) 


Up to now we have not specified an explicit objective for prediction. In the tabular case a continuous 
measure of prediction quality was not necessary because the learned value function could come to equal 
the true value function exactly. Moreover, the learned values at each state were decoupled -an update 
at one state affected no other. But with genuine approximation, an update at one state affects many 
others, and it is not possible to get the values of all states exactly correct. By assumption we have 
far more states than weights, so making one state’s estimate more accurate invariably means making 
others’ less accurate. We are obligated then to say which states we care most about. We must specify 
a state weighting or distribution p(s) > 0, )T) S M s ) = V representing how much we care about the error 
in each state s. By the error in a state s we mean the square of the difference between the approximate 
value v(s, w) and the true value v v (s). Weighting this over the state space by /i, we obtain a natural 
objective function, the Mean Squared Value Error, denoted VE: 


VE(w) = M s ) K(s) - v(s, w) 

ses 


(9.1) 


The square root of this measure, the root VE, gives a rough measure of how much the approximate 
values differ from the true values and is often used in plots. Often p(s) is chosen to be the fraction of 
time spent in s. Under on-policy training this is called the on-policy distribution ; we focus entirely on 
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this case in this chapter. In continuing tasks, the on-policy distribution is the stationary distribution 
under 7 r. 


The on-policy distribution in episodic tasks 


In an episodic task, the on-policy distribution is a little different in that it depends on how the 
initial states of episodes are chosen. Let h(s) denote the probability that an episode begins in 
each state s, and let r](s) denote the number of time steps spent, on average, in state s in a 
single episode. Time is spent in a state s if episodes start in s, or if transitions are made into s 
from a preceding state s in which time is spent: 

r,(s) = h(s) + J2 V(s) £ 7 T(o| 5 )p( S 15, a), for all s € S. (9.2) 

s a 

This system of equations can be solved for the expected number of visits ??(s). The on-policy 
distribution is then the fraction of time spent in each state normalized to sum to one: 

fj,(s) = '^ S ] ' , for all s G §. (9.3) 

(s) 

This is the natural choice without discounting. If there is discounting (7 < 1) it should be treated 
as a form of termination, which can be done simply by including a factor of 7 in the second 
term of (9.2). Although this is more general, it would complicate the following presentation of 
algorithms and concerns a rare case that we don’t treat in this chapter, so we omit it here. 


The two cases, continuing and episodic, behave similarly, but with approximation they must be 
treated separately in formal analyses, as we will see repeatedly in this part of the book. This completes 
the specification of the learning objective. 

But it is not completely clear that the VE is the right performance objective for reinforcement 
learning. Remember that our ultimate purpose—the reason we are learning a value function—is to find 
a better policy. The best value function for this purpose is not necessarily the best for minimizing VE. 
Nevertheless, it is not yet clear what a more useful alternative goal for value prediction might be. For 
now, we will focus on VE. 

An ideal goal in terms of VE would be to find a global optimum, a weight vector w* for which 
VE(w*) < VE(w) for all possible w. Reaching this goal is sometimes possible for simple function 
approximators such as linear ones, but is rarely possible for complex function approximators such as 
artificial neural networks and decision trees. Short of this, complex function approximators may seek 
to converge instead to a local optimum, a weight vector w* for which VE(w*) < VE(w) for all w in 
some neighborhood of w*. Although this guarantee is only slightly reassuring, it is typically the best 
that can be said for nonlinear function approximators, and often it is enough. Still, for many cases of 
interest in reinforcement learning there is no guarantee of convergence to an optimum, or even to within 
a bounded distance of an optimum. Some methods may in fact diverge, with their VE approaching 
infinity in the limit. 

In the last two sections we outlined a framework for combining a wide range of reinforcement learning 
methods for value prediction with a wide range of function approximation methods, using the updates 
of the former to generate training examples for the latter. We also described a VE performance measure 
which these methods may aspire to minimize. The range of possible function approximation methods is 
far too large to cover all, and anyway too little is known about most of them to make a reliable evaluation 
or recommendation. Of necessity, we consider only a few possibilities. In the rest of this chapter we 
focus on function approximation methods based on gradient principles, and on linear gradient-descent 
methods in particular. We focus on these methods in part because we consider them to be particularly 
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promising and because they reveal key theoretical issues, but also because they are simple and our space 
is limited. 


9.3 Stochastic-gradient and Semi-gradient Methods 

We now develop in detail one class of learning methods for function approximation in value prediction, 
those based on stochastic gradient descent (SGD). SGD methods are among the most widely used of 
all function approximation methods and are particularly well suited to online reinforcement learning. 

In gradient-descent methods, the weight vector is a column vector with a fixed number of real valued 
components, w = (u>i,W2, ■ ■ ■ ,w d ) T and the approximate value function D(s,w) is a differentiable 
function of w for all s £ S. We will be updating w at each of a series of discrete time steps, t = 
0,1, 2,3,..., so we will need a notation w t for the weight vector at each step. For now, let us assume 
that, on each step, we observe a new example St H > v n (St) consisting of a (possibly randomly selected) 
state St and its true value under the policy. These states might be successive states from an interaction 
with the environment, but for now we do not assume so. Even though we are given the exact, correct 
values, Vir(St) for each St, there is still a difficult problem because our function approximator has limited 
resources and thus limited resolution. In particular, there is generally no w that gets all the states, or 
even all the examples, exactly correct. In addition, we must generalize to all the other states that have 
not appeared in examples. 

We assume that states appear in examples with the same distribution, /a, over which we are trying 
to minimize the VE as given by (9.1). A good strategy in this case is to try to minimize error on 
the observed examples. Stochastic gradient-descent (SGD) methods do this by adjusting the weight 
vector after each example by a small amount in the direction that would most reduce the error on that 
example: 


i r i 2 

w t+1 = w t - -aV ^(S)) - v(S t , w t ) 

= w t + a\v„(S t ) - v{S t , w t )l Vv(S t , w t ), 


(9.4) 

(9.5) 


where a is a positive step-size parameter, and V/(w), for any scalar expression /( w), denotes the 
vector of partial derivatives with respect to the components of the weight vector: 


v /( „) = tww) on w) 


V dwi 


dwo 


df( w) 
dw d 


(9.6) 


This derivative vector is the gradient of / with respect to w. SGD methods are “gradient descent” 
methods because the overall step in w t is proportional to the negative gradient of the example’s squared 
error (9.4). This is the direction in which the error falls most rapidly. Gradient descent methods are 
called “stochastic” when the update is done, as here, on only a single example, which might have been 
selected stochastically. Over many examples, making small steps, the overall effect is to minimize an 
average performance measure such as the VE. 

It may not be immediately apparent why SGD takes only a small step in the direction of the gradient. 
Could we not move all the way in this direction and completely eliminate the error on the example? In 
many cases this could be done, but usually it is not desirable. Remember that we do not seek or expect 
to find a value function that has zero error for all states, but only an approximation that balances the 
errors in different states. If we completely corrected each example in one step, then we would not find 
such a balance. In fact, the convergence results for SGD methods assume that a decreases over time. 


1 The T denotes transpose, needed here to turn the horizontal row vector in the text into a vertical column vector; in 
this book vectors are generally taken to be column vectors unless explicitly written out horizontally or transposed. 
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If it decreases in such a way as to satisfy the standard stochastic approximation conditions (2.7), then 
the SGD method (9.5) is guaranteed to converge to a local optimum. 

We turn now to the case in which the target output, here denoted U t £ R, of the ftli training example, 
St i y Ut , is not the true value, v n (St), but some, possibly random, approximation to it. For example, 
Ut might be a noise-corrupted version of iy(Si), or it might be one of the bootstrapping targets using 
v mentioned in the previous section. In these cases we cannot perform the exact update (9.5) because 
v n (St) is unknown, but we can approximate it by substituting Ut. in place of v 7r (St). This yields the 
following general SGD method for state-value prediction: 


w t+ i 


w t + a 


U t - v(S t ,w t ) 


Vv(S t ,w t ). 


(9.7) 


If Ut is an unbiased estimate, that is, if E[[/t|St = s] = v n (St), for each t, then w ( is guaranteed to 
converge to a local optimum under the usual stochastic approximation conditions (2.7) for decreasing 
a. 

For example, suppose the states in the examples are the states generated by interaction (or simulated 
interaction) with the environment using policy tt. Because the true value of a state is the expected value 
of the return following it, the Monte Carlo target Ut = Gt is by definition an unbiased estimate of iy (5)). 
With this choice, the general SGD method (9.7) converges to a locally optimal approximation to v 7r (S t ). 
Thus, the gradient-descent version of Monte Carlo state-value prediction is guaranteed to find a locally 
optimal solution. Pseudocode for a complete algorithm is shown in the box below. 


Gradient Monte Carlo Algorithm for Estimating v ~ ty 


Input: the policy tt to be evaluated 

Input: a differentiable function v : § x R d — >■ R 

Initialize value-function weights w as appropriate (e.g., w = 0) 
Repeat forever: 

Generate an episode So, Aq, R\, Si, Ai ,..., Rt, St using tt 
For f = 0,1,...,T — 1: 

w 4 — w T ex [Gt — fi(Syw)] Vf;(St,w) 


One does not obtain the same guarantees if a bootstrapping estimate of v^t^St) is used as the target U t 
in (9.7). Bootstrapping targets such as n-step returns Gf.t+n or the DP target Y^ a s 1 r n ( a \^t)p(s', r \ S t , a)[r+ 
7 i)(s , ,w t )] all depend on the current value of the weight vector w t , which implies that they will be bi¬ 
ased and that they will not produce a true gradient-descent method. One way to look at this is that 
the key step from (9.4) to (9.5) relies on the target being independent of w t . This step would not be 
valid if a bootstrapping estimate were used in place of ry(S't). Bootstrapping methods are not in fact 
instances of true gradient descent (Barnard, 1993). They take into account the effect of changing the 
weight vector w t on the estimate, but ignore its effect on the target. They include only a part of the 
gradient and, accordingly, we call them semi-gradient methods. 

Although semi-gradient (bootstrapping) methods do not converge as robustly as gradient methods, 
they do converge reliably in important cases such as the linear case discussed in the next section. 
Moreover, they offer important advantages that make them often clearly preferred. One reason for 
this is that they typically enable significantly faster learning, as we have seen in Chapters 6 and 7. 
Another is that they enable learning to be continual and online, without waiting for the end of an 
episode. This enables them to be used on continuing problems and provides computational advantages. 

A prototypical semi-gradient method is semi-gradient TD(0), which uses Ut = Rt.+i + yrKSt-i-ijw) as 
its target. Complete pseudocode for this method is given in the box below. 
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Semi-gradient TD(O) for estimating v ~ 


Input: the policy it to be evaluated 

Input: a differentiable function v : S + x R. d — > R. such that v (terminal,-) = 0 

Initialize value-function weights w arbitrarily (e.g., w = 0) 

Repeat (for each episode): 

Initialize S 

Repeat (for each step of episode): 

Choose A ~ 7t(-|5) 

Take action A, observe R, S' 
w «— w + a [R + jv(S',w) — t)(S,w)] Vh(S',w) 

S^ S' 

until S' is terminal 


State aggregation is a simple form of generalizing function approximation in which states are grouped 
together, with one estimated value (one component of the weight vector w) for each group. The value 
of a state is estimated as its group’s component, and when the state is updated, that component alone 
is updated. State aggregation is a special case of SGD (9.7) in which the gradient, Vv(S t , w t ), is 1 for 
St’s group’s component and 0 for the other components. 

Example 9.1: State Aggregation on the 1000-state Random Walk Consider a 1000-state 
version of the random walk task (Examples 6.2 and 7.1 on pages 102 and 118). The states are numbered 
from 1 to 1000, left to right, and all episodes begin near the center, in state 500. State transitions are 
from the current state to one of the 100 neighboring states to its left, or to one of the 100 neighboring 
states to its right, all with equal probability. Of course, if the current state is near an edge, then there 
may be fewer than 100 neighbors on that side of it. In this case, all the probability that would have 
gone into those missing neighbors goes into the probability of terminating on that side (thus, state 1 has 
a 0.5 chance of terminating on the left, and state 950 has a 0.25 chance of terminating on the right). As 
usual, termination on the left produces a reward of —1, and termination on the right produces a reward 
of +1. All other transitions have a reward of zero. We use this task as a running example throughout 
this section. 

Figure 9.1 shows the true value function iv for this task. It is nearly a straight line, but tilted 
slightly toward the horizontal and curving further in this direction for the last 100 states at each end. 
Also shown is the final approximate value function learned by the gradient Monte-Carlo algorithm with 
state aggregation after 100,000 episodes with a step size of a = 2 x 10 -5 . For the state aggregation, 
the 1000 states were partitioned into 10 groups of 100 states each (i.e., states 1-100 were one group, 
states 101-200 were another, and so on). The staircase effect shown in the figure is typical of state 
aggregation; within each group, the approximate value is constant, and it changes abruptly from one 
group to the next. These approximate values are close to the global minimum of the VE (9.1). 

Some of the details of the approximate values are best appreciated by reference to the state distri¬ 
bution p, for this task, shown in the lower portion of the figure with a right-side scale. State 500, in 
the center, is the first state of every episode, but is rarely visited again. On average, about 1.37% of 
the time steps are spent in the start state. The states reachable in one step from the start state are 
the second most visited, with about 0.17% of the time steps being spent in each of them. From there 
/.t falls off almost linearly, reaching about 0.0147% at the extreme states 1 and 1000. The most visible 
effect of the distribution is on the leftmost groups, whose values are clearly shifted higher than the 
unweighted average of the true values of states within the group, and on the rightmost groups, whose 
values are clearly shifted lower. This is due to the states in these areas having the greatest asymmetry 
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Figure 9.1: Function approximation by state aggregation on the 1000-state random walk task, using the gradient 
Monte Carlo algorithm (page 165). ■ 


in their weightings by /r. For example, in the leftmost group, state 99 is weighted more than 3 times 
more strongly than state 0. Thus the estimate for the group is biased toward the true value of state 99, 
which is higher than the true value of state 0. 


9.4 Linear Methods 


One of the most important special cases of function approximation is that in which the approximate 
function, f)(-,w), is a linear function of the weight vector, w. Corresponding to every state s, there is a 
real-valued vector x(s) = (xi(s), ^(s), ■ ■ ■ i a; d( s )) T i with the same number of components as w. Linear 
methods approximate state-value function by the inner product between w and x(s): 


d 

D(s,w) = w T x(s) = ^ WiXi(s). (9.8) 

i—1 

In this case the approximate value function is said to be linear in the weights , or simply linear. 

The vector x(s) is called a feature vector representing state s. Each component Xi(s) of x(s) is the 
value of a function Xi : § —> R. We think of a feature as the entirety of one of these functions, and we 
call its value for a state s a feature of s. For linear methods, features are basis functions because they 
form a linear basis for the set of approximate functions. Constructing d-dimensional feature vectors to 
represent states is the same as selecting a set of d basis functions. Features may be defined in many 
different ways; we cover a few possibilities in the next sections. 

It is natural to use SGD updates with linear function approximation. The gradient of the approximate 
value function with respect to w in this case is 


Vu(s,w) = x(s). 

Thus, in the linear case the general SGD update (9.7) reduces to a particularly simple form: 


w t+ i = w t +a 


U t -v{S tl w t ) x-(St). 


Because it is so simple, the linear SGD case is one of the most favorable for mathematical analysis. 
Almost all useful convergence results for learning systems of all kinds are for linear (or simpler) function 
approximation methods. 
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In particular, in the linear case there is only one optimum (or, in degenerate cases, one set of equally 
good optima), and thus any method that is guaranteed to converge to or near a local optimum is 
automatically guaranteed to converge to or near the global optimum. For example, the gradient Monte 
Carlo algorithm presented in the previous section converges to the global optimum of the VE under 
linear function approximation if a is reduced over time according to the usual conditions. 

The semi-gradient TD(0) algorithm presented in the previous section also converges under linear 
function approximation, but this does not follow from general results on SGD; a separate theorem is 
necessary. The weight vector converged to is also not the global optimum, but rather a point near the 
local optimum. It is useful to consider this important case in more detail, specifically for the continuing 
case. The update at each time t is 


Wt+i 


= Wf + a 


= Wf + a 


(Rt+l + 7 w t Tx i+l - w 7 x i) x t 
(Rt+ixt - x t( x t - 7 x i+i) Tw *)> 


(9.9) 


where here we have used the notational shorthand x t = x(St). Once the system has reached steady 
state, for any given w t , the expected next weight vector can be written 


E[w t+ i|w t ] = w t + a(b - Aw ( ), 


(9.10) 


where 


b = E[R t+ ix t ] € R d 


and A = E 


x t (x t - 7 X t+ i) 


(9.11) 


From (9.10) it is clear that, if the system converges, it must converge to the weight vector wtd at which 


b — Awtd = 0 

=> b - A wtd 

=> wtd = A _1 b. (9-12) 

This quantity is called the TD fixed point. In fact linear semi-gradient TD(0) converges to this point. 

Some of the theory proving its convergence, and the existence of the inverse above, is given in the box. 


Proof of Convergence of Linear TD(0) 


What properties assure convergence of the linear TD(0) algorithm (9.9)? Some insight can be 
gained by rewriting (9.10) as 

E[w t+1 |w t ] = (I — aA)w t + ab. (9.13) 

Note that the matrix A multiplies the weight vector w t and not b; only A is important to 
convergence. To develop intuition, consider the special case in which A is a diagonal matrix. If 
any of the diagonal elements are negative, then the corresponding diagonal element of I — ccA 
will be greater than one, and the corresponding component of w t will be amplified, which will 
lead to divergence if continued. On the other hand, if the diagonal elements of A are all positive, 
then a can be chosen smaller than one over the largest of them, such that I — aA is diagonal 
with all diagonal elements between 0 and 1. In this case the first term of the update tends to 
shrink w t , and stability is assured. In general case, w t will be reduced toward zero whenever A 
is positive definite, meaning y T Ay > 0 for real vector y. Positive definiteness also ensures that 
the inverse A -1 exists. 
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For linear TD(0), in the continuing case with 7 < 1, the A matrix (9.11) can be written 
A = ^2 n{s)'^2n(a\s)'^2p(r, s'| s, a)x(s) (x(s) - 7 x(s')) T 

s a r,s' 

= i s ) x ( s H x ( s ) - 7 x(s')) t 

S s' 

= M s ) x 0 ) f x (s) - 7 p ( s ' 1 s ) x ( s ')] 

s ' s' ' 

= X T D(I- 7 P)X, 

where /x(s) is the stationary distribution under 7 r, p(s' |s) is the probability of transition from s 
to s' under policy 7 r, P is the |S| x |S| matrix of these probabilities, D is the |S| x |S| diagonal 
matrix with the p(s) on its diagonal, and X is the |S| x d matrix with x(s) as its rows. From 
here it is clear that the inner matrix D(I 7 P) is key to determining the positive definiteness 

of A. 

For a key matrix of this type, positive definiteness is assured if all of its columns sum to a 
nonnegative number. This was shown by Sutton (1988, p. 27) based on two previously established 
theorems. One theorem says that any matrix M is positive definite if and only if the symmetric 
matrix S = M + M T is positive definite (Sutton 1988, appendix). The second theorem says 
that any symmetric real matrix S is positive definite if all of its diagonal entries are positive and 
greater than the sum of the corresponding off-diagonal entries (Varga 1962, p. 23). For our key 
matrix, D(I — 7 P), the diagonal entries are positive and the off-diagonal entries are negative, so 
all we have to show is that each row sum plus the corresponding column sum is positive. The 
row sums are all positive because P is a stochastic matrix and 7 < 1. Thus it only remains to 
show that the column sums are nonnegative. Note that the row vector of the column sums of 
any matrix M can be written as 1 T M, where 1 is the column vector with all components equal 
to 1. Let /x denote the |S|-vector of the p(s), where p = P T /x by virtue of p being the stationary 
distribution. The column sums of our key matrix, then, are: 

1 t D(I- 7 P) = M T (I-7P) 

= M T - 7M T P 

= p T — 7 /x t (because p is the stationary distribution) 

= (1 ~7)/L 

all components of which are positive. Thus, the key matrix and its A matrix are positive definite, 
and on-policy TD(0) is stable. (Additional conditions and a schedule for reducing a over time 
are needed to prove convergence with probability one.) 


At the TD fixed point, it has also been proven (in the continuing case) that the VE is within a 
bounded expansion of the lowest possible error: 


VE(wtd) < -minVE(w). 

1 — 7 W 


(9.14) 


That is, the asymptotic error of the TD method is no more than times the smallest possible error, 
that attained in the limit by the Monte Carlo method. Because 7 is often near one, this expansion 
factor can be quite large, so there is substantial potential loss in asymptotic performance with the TD 
method. On the other hand, recall that the TD methods are often of vastly reduced variance compared 
to Monte Carlo methods, and thus faster, as we saw in Chapters 6 and 7. Which method will be best 
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depends on the nature of the approximation and problem, and on how long learning continues. 

A bound analogous to (9.14) applies to other on-policy bootstrapping methods as well. For example, 
linear semi-gradient DP (Eq. 9.7 with U t = 7r ( a l'S't) r p( s '’ r I a )[ r + 7 u(s',w t )]) with updates 
according to the on-policy distribution will also converge to the TD fixed point. One-step semi-gradient 
action-value methods, such as semi-gradient Sarsa(O) covered in the next chapter converge to an anal¬ 
ogous fixed point and an analogous bound. For episodic tasks, there is a slightly different but related 
bound (see Bertsekas and Tsitsiklis, 1996). There are also a few technical conditions on the rewards, 
features, and decrease in the step-size parameter, which we have omitted here. The full details can be 
found in the original paper (Tsitsiklis and Van Roy, 1997). 

Critical to the these convergence results is that states are updated according to the on-policy dis¬ 
tribution. For other update distributions, bootstrapping methods using function approximation may 
actually diverge to infinity. Examples of this and a discussion of possible solution methods are given in 
Chapter 11. 

Example 9.2: Bootstrapping on the 1000-state Random Walk State aggregation is a special 
case of linear function approximation, so let’s return to the 1000-state random walk to illustrate some of 
the observations made in this chapter. The left panel of Figure 9.2 shows the final value function learned 
by the semi-gradient TD(0) algorithm (page 166) using the same state aggregation as in Example 9.1. 
We see that the near-asymptotic TD approximation is indeed farther from the true values than the 
Monte Carlo approximation shown in Figure 9.1. 

Nevertheless, TD methods retain large potential advantages in learning rate, and generalize Monte 
Carlo methods, as we investigated fully with the multi-step TD methods of Chapter 7. The right panel 
of Figure 9.2 shows results with an n-step semi-gradient TD method using state aggregation and the 
1000-state random walk that are strikingly similar to those we obtained earlier with tabular methods 
and the 19-state random walk (Figure 7.2). To obtain such quantitatively similar results we switched 
the state aggregation to 20 groups of 50 states each. The 20 groups are then quantitatively close to the 
19 states of the tabular problem. In particular, the state transitions of at-most 100 states to the right 
or left, or 50 states on average, were quantitively analogous to the single-state state transitions of the 
tabular system. To complete the match, we use here the same performance measure—an unweighted 
average of the RMS error over all states and over the first 10 episodes—rather than a VE objective as 
is otherwise more appropriate when using function approximation. 




Figure 9.2: Bootstrapping with state aggregation on the 1000-state random walk task. Left : Asymptotic values 
of semi-gradient TD are worse than the asymptotic Monte Carlo values in Figure 9.1. Right-. Performance of 72- 
step methods with state-aggregation are strikingly similar to those with tabular representations (cf. Figure 7.2). 
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The semi-gradient ?r-step TD algorithm we used in this example is the natural extension of the 
tabular ?r-step TD algorithm presented in Chapter 7 to semi-gradient function approximation. The key 
equation, analogous to (7.2), is 

w t+n = w t+n _i + a[G t :t+n - •C(<S't,w t+n _i)] VD(5 t ,w t+ „_i), 0 < t < T, (9.15) 

where the n-step return is generalized from (7.1) to 

Gf.t+n = Rt+l + 7-Rt+2 + • • • + 7 " 1 Rt+n + 7"^(*S't+rai w t+n-l)) 0 < t < T — n. (9.16) 

Pseudocode for the complete algorithm is given in the box below. 


?r-step semi-gradient TD for estimating v « v w 


Input: the policy 7r to be evaluated 

Input: a differentiable function v : S + x —> R. such that v (terminal,-) = 0 

Parameters: step size a £ (0,1], a positive integer n 

All store and access operations (St and Rt) can take their index mod n 

Initialize value-function weights w arbitrarily (e.g., w = 0) 

Repeat (for each episode): 

Initialize and store Sq ^ terminal 

T ■£- oo 

For t = 0,1, 2,... : 

| If t < T, then: 

Take an action according to 7r(-|S t ) 

Observe and store the next reward as R t +1 and the next state as St+i 
If St+i is terminal, then T ■£- t + 1 

t ■£- t — n + 1 (t is the time whose state’s estimate is being updated) 

If r > 0: 

G £- ^ min ( T+n ’ T ) Y~ T ~ 1 J{- 

If t + n <T, then: G -f- G + 7 n u(5' r+ „,w) ( G T , T+n ) 

w i— w + a [G — #(£ r ,w)] VD(5 r ,w) 

Until t = T — 1 


9.5 Feature Construction for Linear Methods 

Linear methods are interesting because of their convergence guarantees, but also because in practice 
they can be very efficient in terms of both data and computation. Whether or not this is so depends 
critically on how the states are represented in terms of features, which we investigate in this large section. 
Choosing features appropriate to the task is an important way of adding prior domain knowledge to 
reinforcement learning systems. Intuitively, the features should correspond to the aspects of the state 
space along which generalization may be appropriate. If we are valuing geometric objects, for example, 
we might want to have features for each possible shape, color, size, or function. If we are valuing states 
of a mobile robot, then we might want to have features for locations, degrees of remaining battery 
power, recent sonar readings, and so on. 

A limitation of the linear form is that it cannot take into account any interactions between features, 
such as the presence of feature i being good only in the absence of feature j. For example, in the 
pole-balancing task (Example 3.4), a high angular velocity can be either good or bad depending on 
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the angle. If the angle is high, then high angular velocity means an imminent danger of falling—a bad 
state—whereas if the angle is low, then high angular velocity means the pole is righting itself -a good 
state. A linear value function could not represent this if its features coded separately for the angle and 
the angular velocity. It needs instead, or in addition, features for combinations of these two underlying 
state dimensions. In the following subsections we consider a variety of general ways of doing this. 


9.5.1 Polynomials 

The states of many problems are initially expressed as numbers, such as positions and velocities in 
the pole-balancing task (Example 3.4), the number of cars in each lot in the Jack’s car rental problem 
(Example 4.2), or the gambler’s capital in the gambler problem (Example 4.3). In these types of 
problems, function approximation for reinforcement learning has much in common with the familiar 
tasks of interpolation and regression. Various families of features commonly used for interpolation 
and regression can also be used in reinforcement learning. Polynomials make up one of the simplest 
families of features used for interpolation and regression consists. While the basic polynomial features 
we discuss here do not work as well as other types of features in reinforcement learning, they serve as 
a good introduction because they are simple and familiar. 

As an example, suppose a reinforcement learning problem has states with two numerical dimensions. 
For a single representative state s, let its two numbers be Si € K and s 2 G R. You might choose to 
represent s simply by its two state dimensions, so that x(s) = (si,S 2 ) t , but then you would not be 
able to take into account any interactions between these dimensions. In addition, if both si and S 2 
were zero, then the approximate value would have to also be zero. Both limitations can be overcome 
by instead representing s by the four-dimensional feature vector x(s) = (1, si, S 2 , siS 2 ) T - The initial 
1 feature allows the representation of affine functions in the original state numbers, and the final 
product feature, S 1 S 2 , enables interactions to be taken into account. Or you might choose to use higher¬ 
dimensional feature vectors like x(s) = (1, Si, S 2 , S 1 S 2 , sf, s i s 2 i s i s 2, SiS 2 ) T t° take more complex 
interactions into account. Such feature vectors enable approximations as arbitrary quadratic functions 
of the state numbers—even though the approximation is still linear in the weights that have to be 
learned. Generalizing this example from two to k numbers, we can represent highly-complex interactions 
among a problem’s state dimensions: 

Suppose each state s corresponds to k numbers, si, S 2 , ..., Sk, with each Sj £ R. For this 
fc-dimensional state space, each order-n polynomial-basis feature Xi can be written as 

Xi(a)=U^ =1 a^, (9.17) 

where each Cij is an integer in the set {0,1,... ,n} for an integer n > 0. These features make 
up the order-n polynomial basis for dimension k, which contains (n + l) fc different features. 


Higher-order polynomial bases allow for more accurate approximations of more complicated func¬ 
tions. But because the number of features in an order-?! polynomial basis grows exponentially with the 
dimension k of the natural state space (if n > 0), it is generally necessary to select a subset of them 
for function approximation. This can be done using prior beliefs about the nature of the function to 
be approximated, and some automated selection methods developed for polynomial regression can be 
adapted to deal with the incremental and nonstationary nature of reinforcement learning. 

Exercise 9.1 Why does (9.17) define (n+ l) fe distinct features for dimension fc? □ 

Exercise 9.2 What n and c,;j produce the feature vectors x(s) = (1, Si, s 2, S1S2, sf, s|, sis|, sf S2, s\ s^) 1 ”? 

□ 
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9.5.2 Fourier Basis 

Another linear function approximation method is based on the time-honored Fourier series, which 
expresses periodic functions as weighted sums of sine and cosine basis functions (features) of different 
frequencies. (A function / is periodic if f(x ) = f(x + r) for all x and some period r.) The Fourier 
series and the more general Fourier transform are widely used in applied sciences in part because if a 
function to be approximated is known, then the basis function weights are given by simple formulae 
and, further, with enough basis functions essentially any function can be approximated as accurately as 
desired. In reinforcement learning, where the functions to be approximated are unknown, Fourier basis 
functions are of interest because they are easy to use and can perform well in a range of reinforcement 
learning problems. 

First consider the one-dimensional case. The usual Fourier series representation of a function of one 
dimension having period r represents the function as a linear combination of sine and cosine functions 
that are each periodic with periods that evenly divide r (in other words, whose frequencies are integer 
multiples of a fundamental frequency 1/r). But if you are interested in approximating an aperiodic 
function defined over a bounded interval, then you can use these Fourier basis featues with r set to the 
length the interval. The function of interest is then just one period of the periodic linear combination 
of the sine and cosine features. 

Furthermore, if you set r to twice the length of the interval of interest and restrict attention to the 
approximation over the half interval [0, r/2], then you can use just the cosine features. This is possible 
because you can represent any even function, that is, any function that is symmetric about the origin, 
with just the cosine basis. So any function over the half-period [0, r/2] can be approximated as closely as 
desired with enough cosine features. (Saying “any function” is not exactly correct because the function 
has to be mathematically well-behaved, but we skip this technicality here.) Alternatively, it is possible 
to use just sine features, linear combinations of which are always odd functions, that is functions that 
are anti-symmetric about the origin. But it is generally better to keep just the cosine features because 
“half-even” functions tend to be easier to approximate than “half-odd” functions since the latter are 
often discontinuous at the origin. Of course, this does not rule out using both sine and cosine features 
to approximate over the interval [0, r/2], which might advantageous in some circumstances. 

Following this logic and letting r = 2 so that the features are defined over the half-r interval [0,1], 
the one-dimensional order-n Fourier cosine basis consists of the n + 1 features 

Xi(s) = cos (ins), s £ [0,1], 

for i = 0,..., n. Figure 9.3 shows one-dimensional Fourier cosine features X{, for i = 1,2,3,4; Xq is a 
constant function. 



Figure 9.3: One-dimensional Fourier cosine-basis features Xi, i = 1,2, 3,4, for approximating functions over 
the interval [0,1]. After Konidaris et al. (2011). 
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This same reasoning applies to the Fourier cosine series approximation in the multi-dimensional case 
as described in the box below. 

Suppose each state s corresponds to a vector of k numbers, s = (si, S 2 ,..., Sfc) T , with each 
Si £ [0,1]. The ith feature in the order-n Fourier cosine basis can then be written 

Xi(s) = cos (7 ts t c*) , (9.18) 

where c l = (c \,..., c \) T , with c* £ {0,..., n} for j = 1 ,k and i = 0,..., (n + l) fc . This 
defines a feature for each of the (n + l) fc possible integer vectors c*. The inner product s T c* has 
the effect of assigning an integer in {0,..., n} to each dimension of s. As in the one-dimensional 
case, this integer determines the feature’s frequency along that dimension. The features can of 
course be shifted and scaled to suit the bounded state space of a particular application. 


As an example, consider the k = 2 case in which s = (si,S 2 ) t , where each c l = (c),c?>) T . Figure 9.4 
shows a selection of six Fourier cosine features, each labeled by the vector c* that defines it (si is the 
horizontal axis and c* is shown as a row vector with the index i omitted). Any zero in c means the 
feature is constant along that state dimension. So if c = (0,0) T , the feature is constant over both 
dimensions; if c = (ci,0) T the feature is constant over the second dimension and varies over the first 
with frequency depending on ci; and similarly, for c = (0, C2) T . When c = (ci,C2) t with neither 
Cj = 0, the feature varies along both dimensions and represents an interaction between the two state 
variables. The values of C\ and C2 determine the frequency along each dimension, and their ratio gives 
the direction of the interaction. 

When using Fourier cosine features with a learning algorithm such as (9.7), semi-gradient TD(0), 
or semi-gradient Sarsa, it may be helpful to use a different step-size parameter for each feature. If 
a is the basic step-size parameter, then Konidaris, Osentoski, and Thomas (2011) suggest setting the 
step-size parameter for feature Xi to cq = a/\J ( c \) 2 + • • • + ( c \) 2 (except when each c* = 0, in which 



Figure 9.4: A selection of six two-dimensional Fourier cosine features, each labeled by the vector c* that defines 
it (si is the horizontal axis, and c* is shown with the index i omitted). After Konidaris et al. (2011). 
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case oti = a). 

Fourier cosine features with Sarsa can produce good performance compared to several other collections 
of basis functions, including polynomial and radial basis functions. Not surprisingly, however, Fourier 
features have trouble with discontinuities because it is difficult to avoid “ringing” around points of 
discontinuity unless very high frequency basis functions are included. 

The number of features in the order-n Fourier basis grows exponentially with the dimension of the 
state space, but if that dimension is small enough (e.g., k < 5), one can select n so that all of the order-n 
Fourier features can be used. This makes the selection of features more-or-less automatic. For high 
dimension state spaces, however, it is necessary to select a subset of these features. This can be done 
using prior beliefs about the nature of the function to be approximated, and some automated selection 
methods can be adapted to deal with the incremental and nonstationary nature of reinforcement learn¬ 
ing. An advantage of Fourier basis features in this regard are that it is easy to select features by setting 
the c* vectors to account for suspected interactions among the state variables, and by limiting the values 
in the c J vectors so that the approximation can filter out high frequency components considered to be 
noise. On the other hand, because Fourier features are non-zero over the entire state space (with the 
few zeros excepted), they represent global properties of states, which can make it difficult to find good 
ways to represent local properties. 

Figure 9.5 shows learning curves comparing the Fourier and polynomial bases on the 1000-state 
random walk example. In general, we do not recommend using polynomials for online learning. 2 


VVE 



Episodes 


Figure 9.5: Fourier basis vs polynomials on the 1000-state random walk. Shown are learning curves for the 
gradient Monte Carlo method with Fourier and polynomial bases of order 5, 10, and 20. The step-size parameters 
were roughly optimized for each case: a = 0.0001 for the polynomial basis and a = 0.00005 for the Fourier 
basis. 


Exercise 9.3 Why does (9.18) define (n + l) fc distinct features? 


□ 


9.5.3 Coarse Coding 

Consider a task in which the natural representation of the state set is a continuous two-dimensional 
space. One kind of representation for this case is made up of features corresponding to circles in state 
space, as shown in Figure 9.6. If the state is inside a circle, then the corresponding feature has the value 

2 There are families of polynomials more complicated than those we have discussed, for example, different families of 
orthogonal polynomials, and these might work better, but at present there is little experience with them in reinforcement 
learning. 
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Figure 9.6: Coarse coding. Generalization from state s to state s' depends on the number of their features 
whose receptive fields (in this case, circles) overlap. These states have one feature in common, so there will be 
slight generalization between them. 


1 and is said to be present ; otherwise the feature is 0 and is said to be absent. This kind of 1 0-valued 
feature is called a binary feature. Given a state, which binary features are present indicate within which 
circles the state lies, and thus coarsely code for its location. Representing a state with features that 
overlap in this way (although they need not be circles or binary) is known as coarse coding. 

Assuming linear gradient-descent function approximation, consider the effect of the size and density 
of the circles. Corresponding to each circle is a single weight (a component of w) that is affected by 
learning. If we train at one state, a point in the space, then the weights of all circles intersecting that 
state will be affected. Thus, by (9.8), the approximate value function will be affected at all states within 
the union of the circles, with a greater effect the more circles a point has “in common” with the state, 
as shown in Figure 9.6. If the circles are small, then the generalization will be over a short distance, as 
in Figure 9.7 (left), whereas if they are large, it will be over a large distance, as in Figure 9.7 (middle). 
Moreover, the shape of the features will determine the nature of the generalization. For example, if 
they are not strictly circular, but are elongated in one direction, then generalization will be similarly 
affected, as in Figure 9.7 (right). 



Narrow generalization 



Broad generalization 



Asymmetric generalization 


Figure 9.7: Generalization in linear function approximation methods is determined by the sizes and shapes of 
the features’ receptive fields. All three of these cases have roughly the same number and density of features. 

Features with large receptive fields give broad generalization, but might also seem to limit the learned 
function to a coarse approximation, unable to make discriminations much finer than the width of the 
receptive fields. Happily, this is not the case. Initial generalization from one point to another is indeed 
controlled by the size and shape of the receptive fields, but acuity, the finest discrimination ultimately 
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possible, is controlled more by the total number of features. 

Example 9.3: Coarseness of Coarse Coding This example illustrates the effect on learning of the 
size of the receptive fields in coarse coding. Linear function approximation based on coarse coding and 
(9.7) was used to learn a one-dimensional square-wave function (shown at the top of Figure 9.8). The 
values of this function were used as the targets, U t . With just one dimension, the receptive fields were 
intervals rather than circles. Learning was repeated with three different sizes of the intervals: narrow, 
medium, and broad, as shown at the bottom of the figure. All three cases had the same density of 
features, about 50 over the extent of the function being learned. Training examples were generated 
uniformly at random over this extent. The step-size parameter was a = —, where n is the number of 
features that were present at one time. Figure 9.8 shows the functions learned in all three cases over 
the course of learning. Note that the width of the features had a strong effect early in learning. With 
broad features, the generalization tended to be broad; with narrow features, only the close neighbors of 
each trained point were changed, causing the function learned to be more bumpy. However, the final 
function learned was affected only slightly by the width of the features. Receptive field shape tends to 
have a strong effect on generalization but little effect on asymptotic solution quality. 
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Figure 9.8: Example of feature width’s strong effect on initial generalization (first row) and weak effect on 
asymptotic accuracy (last row). ■ 


desired 
" function 


approx- 

imation 

_ _.= 





Narrow 

features 




features 


features 


feature 

width 


9.5.4 Tile Coding 

Tile coding is a form of coarse coding for multi-dimensional continuous spaces that is flexible and 
computationally efficient. It may be the most practical feature representation for modern sequential 
digital computers. Open-source software is available for many kinds of tile coding. 

In tile coding the receptive fields of the features are grouped into partitions of the state space. Each 
such partition is called a tiling , and each element of the partition is called a tile. For example, the 
simplest tiling of a two-dimensional state space is a uniform grid such as that shown on the left side of 
Figure 9.9. The tiles or receptive field here are squares rather than the circles in Figure 9.6. If just this 
single tiling were used, then the state indicated by the white spot would be represented by the single 
feature whose tile it falls within; generalization would be complete to all states within the same tile and 
nonexistent to states outside it. With just one tiling, we would not have coarse coding by just a case 
of state aggregation. 
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Four active 
tiles/features 
overlap the point 
and are used to 
represent it 


Figure 9.9: Multiple, overlapping grid-tilings on a limited two-dimensional space. These tilings are offset from 
one another by a uniform amount in each dimension. 


To get the strengths of coarse coding requires overlapping receptive fields, and by definition the tiles 
of a partition do not overlap. To get true coarse coding with tile coding, multiple tilings are used, 
each offset by a fraction of a tile width. A simple case with four tilings is shown on the right side of 
Figure 9.9. Every state, such as that indicated by the white spot, falls in exactly one tile in each of 
the four tilings. These four tiles correspond to four features that become active when the state occurs. 
Specifically, the feature vector x(s) has one component for each tile in each tiling. In this example there 
are 4 x 4 x 4 = 64 components, all of which will be 0 except for the four corresponding to the tiles that 
s falls within. Figure 9.10 shows the advantage of multiple offset tilings (coarse coding) over a single 
tiling on the 1000-state random walk example. 

An immediate practical advantage of tile coding is that, because it works with partitions, the overall 
number of features that are active at one time is the same for any state. Exactly one feature is present 
in each tiling, so the total number of features present is always the same as the number of tilings. This 
allows the step-size parameter, a, to be set in an easy, intuitive way. For example, choosing a = 
where n is the number of tilings, results in exact one-trial learning. If the example s <—> v is trained 
on, then whatever the prior estimate, v(s, Wj), the new estimate will be D(s,w f+ i) = v. Usually one 



Episodes 

Figure 9.10: Why we use coarse coding. Shown are learning curves on the 1000-state random walk example 
for the gradient Monte Carlo algorithm with a single tiling and with multiple tilings. The space of 1000 states 
was treated as a single continuous dimension, covered with tiles each 200 states wide. The multiple tilings were 
offset from each other by 4 states. The step-size parameter was set so that the initial learning rate in the two 
cases was the same, a = 0.0001 for the single tiling and a = 0.0001/50 for the 50 tilings. 
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wishes to change more slowly than this, to allow for generalization and stochastic variation in target 
outputs. For example, one might choose a = in which case the estimate for the trained state 
would move one-tenth of the way to the target in one update, and neighboring states will be moved 
less, proportional to the number of tiles they have in common. 

Tile coding also gains computational advantages from its use of binary feature vectors. Because each 
component is either 0 or 1, the weighted sum making up the approximate value function (9.8) is almost 
trivial to compute. Rather than performing d multiplications and additions, one simply computes the 
indices of the n <C d active features and then adds up the n corresponding components of the weight 
vector. 

Generalization occurs to states other than the one trained if those states fall within any of the same 
tiles, proportional to the number of tiles in common. Even the choice of how to offset the tilings 
from each other affects generalization. If they are offset uniformly in each dimension, as they were in 
Figure 9.9, then different states can generalize in qualitatively different ways, as shown below in the 
upper half of Figure 9.11. Each of the eight subfigures show the pattern of generalization from a trained 
state to nearby points. In this example their are eight tilings, thus 64 subregions within a tile that 
generalize distinctly, but all according to one of these eight patterns. Note how uniform offsets result 
in a strong effect along the diagonal in many patterns. These artifacts can be avoided if the tilings are 
offset asymmetrically, as shown in the lower half of the figure. These lower generalization patterns are 

« % 

m m 

■ 

# * 

Figure 9.11: Why tile asymmetrical offsets are preferred in tile coding. Shown is the strength of generalization 
from a trained state, indicated by the small black plus, to nearby states, for the case of eight tilings. If the tilings 
are uniformly offset (above), then there are diagonal artifacts and substantial variations in the generalization, 
whereas with asymmetrically offset tilings the generalization is more spherical and homogeneous. 
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better because they are all well centered on the trained state with no obvious asymmetries. 

Tilings in all cases are offset from each other by a fraction of a tile width in each dimension. If w 
denotes the tile width and n the number of tilings, then ^ is a fundamental unit. Within small squares 
j on a side, all states activate the same tiles, have the same feature representation, and the same 
approximated value. If a state is moved by ^ in any cartesian direction, the feature representation 
changes by one component/tile. Uniformly offset tilings are offset from each other by exactly this unit 
distance. For a two-dimensional space, we say that each tiling is offset by the displacement vector 
(1,1), meaning that it is offset from the previous tiling by — times this vector. In these terms, the 
asymmetrically offset tilings shown in the lower part of Figure 9.11 are offset by a displacement vector 
of (1,3). 

Extensive studies have been made of the effect of different displacement vectors on the generalization 
of tile coding (Parks and Militzer, 1991; An, 1991; An, Miller and Parks, 1991; Miller, Glanz and Carter, 
1991), assessing their homegeneity and tendency toward diagonal artifacts like those seen for the (1,1) 
displacement vectors. Based on this work, Miller and Glanz (1996) recommend using displacement 
vectors consisting of the first odd integers. In particular, for a continuous space of dimension k, a good 
choice is to use the first odd integers (1, 3, 5, 7,..., 2 k — 1), with n (the number of tilings) set to an 
integer power of 2 greater than or equal to 4 k. This is what we have done to produce the tilings in 
the lower half of Figure 9.11, in which k = 2, n = 2 3 > 4 k, and the displacement vector is (1,3). In 
a three-dimensional case, the first four tilings would be offset in total from a base position by (0, 0, 0), 
(1, 3, 5), (2,6,10), and (3, 9,15). Open-source software that can efficiently make tilings like this for any 
k is readily available. 

In choosing a tiling strategy, one has to pick the number of the tilings and the shape of the tiles. The 
number of tilings, along with the size of the tiles, determines the resolution or fineness of the asymptotic 
approximation, as in general coarse coding and illustrated in Figure 9.8. The shape of the tiles will 
determine the nature of generalization as in Figure 9.7. Square tiles will generalize roughly equally 
in each dimension as indicated in Figure 9.11 (lower). Tiles that are elongated along one dimension, 
such as the stripe tilings in Figure 9.12 (middle), will promote generalization along that dimension. 
The tilings in Figure 9.12 (middle) are also denser and thinner on the left, promoting discrimination 
along the horizonal dimension at lower values along that dimension. The diagonal stripe tiling in 
Figure 9.12 (right) will promote generalization along one diagonal. In higher dimensions, axis-aligned 
stripes correspond to ignoring some of the dimensions in some of the tilings, that is, to hyperplanar 
slices. Irregular tilings such as shown in Figure 9.12 (left) are also possible, though rare in practice and 
beyond the standard software. 

In practice, it is often desirable to use different shaped tiles in different tilings. For example, one 
might use some vertical stripe tilings and some horizontal stripe tilings. This would encourage gener- 
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Figure 9.12: Tilings need not be grids. They can be arbitrarily shaped and non-uniform, while still in many 
cases being computationally efficient to compute. 
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alization along either dimension. However, with stripe tilings alone it is not possible to learn that a 
particular conjunction of horizontal and vertical coordinates has a distinctive value (whatever is learned 
for it will bleed into states with the same horizontal and vertical coordinates). For this one needs the 
conjunctive rectangular tiles such as originally shown in Figure 9.9. With multiple tilings—some hori¬ 
zontal, same vertical, and some conjunctive—one can get everything: a preference for generalizing along 
each dimension, yet the ability to learn specific values for conjunctions (see Sutton, 1996 for examples). 
The choice of tilings determines generalization, and until this choice can be effectively automated, it 
is important that tile coding enables the choice to be made flexibly and in a way that makes sense to 
people. 

Another useful trick for reducing memory requirements is hashing —a con¬ 
sistent pseudo-random collapsing of a large tiling into a much smaller set of 
tiles. Hashing produces tiles consisting of noncontiguous, disjoint regions ran¬ 
domly spread throughout the state space, but that still form an exhaustive 
partition. For example, one tile might consist of the four subtiles shown to 
the right. Through hashing, memory requirements are often reduced by large 
factors with little loss of performance. This is possible because high resolution 
is needed in only a small fraction of the state space. Hashing frees us from the 
curse of dimensionality in the sense that memory requirements need not be exponential in the number 
of dimensions, but need merely match the real demands of the task. Good open-source implementations 
of tile coding, including hashing, are widely available. 

Exercise 9.4 Suppose we believe that one of two state dimensions is more likely to have an effect 
on the value function than is the other, that generalization should be primarily across this dimension 
rather than along it. What kind of tilings could be used to take advantage of this prior knowledge? □ 



9.5.5 Radial Basis Functions 


Radial basis functions (RBFs) are the natural generalization of coarse coding to continuous-valued 
features. Rather than each feature being either 0 or 1, it can be anything in the interval [0,1], reflecting 
various degrees to which the feature is present. A typical RBF feature, Xi, has a Gaussian (bell-shaped) 
response Xi(s ) dependent only on the distance between the state, s, and the feature’s prototypical or 
center state, c^, and relative to the feature’s width, eq: 


Xi(s) = exp 


2 of ) ' 


The norm or distance metric of course can be chosen in whatever way seems most appropriate to the 
states and task at hand. Figure 9.13 shows a one-dimensional example with a Euclidean distance metric. 

The primary advantage of RBFs over binary features is that they produce approximate functions 
that vary smoothly and are differentiable. Although this is appealing, in most cases it has no practical 
significance. Nevertheless, extensive studies have been made of graded response functions such as RBFs 
in the context of tile coding (An, 1991; Miller et ah, 1991; An, Miller and Parks, 1991; Lane, Handelman 



Figure 9.13: One-dimensional radial basis functions. 
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and Gelfand, 1992). All of these methods require substantial additional computational complexity 
(over tile coding) and often reduce performance when there are more than two state dimensions. In 
high dimensions the edges of tiles are much more important, and it has proven difficult to obtain well 
controlled graded tile activations near the edges. 

An RBF network is a linear function approximator using RBFs for its features. Learning is defined by 
equations (9.7) and (9.8), exactly as in other linear function approximators. In addition, some learning 
methods for RBF networks change the centers and widths of the features as well, bringing them into the 
realm of nonlinear function approximators. Nonlinear methods may be able to fit target functions much 
more precisely. The downside to RBF networks, and to nonlinear RBF networks especially, is greater 
computational complexity and, often, more manual tuning before learning is robust and efficient. 


9.6 Nonlinear Function Approximation: Artificial Neural Net¬ 
works 

Artificial neural networks (ANNs) are widely used for nonlinear function approximation. An ANN 
is a network of interconnected units that have some of the properties of neurons, main component 
of nervous systems. ANNs have a long history, with the latest advances in training deeply-layered 
ANNs being responsible for some of the most impressive abilities of machine learning systems, including 
reinforcement learning systems. In Chapter 16 we describe several impressive examples of reinforcement 
learning systems that use ANN function approximation. 

Figure 9.14 shows a generic feedforward ANN, meaning that there are no loops in the network, 
that is, there are no paths within the network by which a unit’s output can influence its input. The 
network in the figure has an output layer consisting of two output units, an input layer with four input 
units, and two hidden layers: layers that are neither input nor output layers. A real-valued weight is 
associated with each link. A weight roughly corresponds to the efficacy of a synaptic connection in 
a real neural network (see Section 15.1). If an ANN has at least one loop in its connections, it is a 
recurrent rather than a feedforward ANN. Although both feedforward and recurrent ANNs have been 
used in reinforcement learning, here we look only at the simpler feedforward case. 

The units (the circles in Figure 9.14) are typically semi-linear units, meaning that they compute 
a weighted sum of their input signals and then apply to the result a nonlinear function, called the 



Figure 9.14: A generic feedforward neural network with four input units, two output units, and two hidden 
layers. 
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activation function , to produce the unit’s output, or activation. Different activation functions are used, 
but they are typically S-shaped, or sigmoid, functions such as the logistic function f{x) = 1/(1 + e ~ x ), 
though sometimes the rectifier nonlinearity f{x) = max(0,:r) is used. A step function like f{x) = 1 if 
x > 9, and 0 otherwise, results in a binary unit with threshold 6. It is often useful for units in different 
layers to use different activation functions. 

The activation of each output unit of a feedforward ANN is a nonlinear function of the activation 
patterns over the network’s input units. The functions are parameterized by the network’s connection 
weights. An ANN with no hidden layers can represent only a very small fraction of the possible input- 
output functions. However an ANN with a single hidden layer containing a large enough finite number 
of sigmoid units can approximate any continuous function on a compact region of the network’s input 
space to any degree of accuracy (Cybenko, 1989). This is also true for other nonlinear activation 
functions that satisfy mild conditions, but nonlinearity is essential: if all the units in a multi-layer 
feedforward ANN have linear activation functions, the entire network is equivalent to a network with 
no hidden layers (because linear functions of linear functions are themselves linear). 

Despite this “universal approximation” property of one-hidden-layer ANNs, both experience and 
theory show that approximating the complex functions needed for many artificial intelligence tasks is 
made easier—-indeed may require—abstractions that are hierarchical compositions of many layers of 
lower-level abstractions, that is, abstractions produced by deep architectures such as ANNs with many 
hidden layers. (See Bengio, 2009, for a thorough review.) The successive layers of a deep ANN compute 
increasingly abstract representations of the network’s “raw” input, with each unit providing a feature 
contributing to a hierarchical representation of the overall input-output function of the network. 

Training the hidden layers of an ANN is therefore a way to automatically create features appropriate 
for a given problem so that hierarchical representations can be produced without relying exclusively on 
hand-crafted features. This has been an enduring challenge for artificial intelligence and explains why 
learning algorithms for ANNs with hidden layers have received so much attention over the years. ANNs 
typically learn by a stochastic gradient method (Section 9.3). Each weight is adjusted in a direction 
aimed at improving the network’s overall performance as measured by an objective function to be either 
minimized or maximized. In the most common supervised learning case, the objective function is the 
expected error, or loss, over a set of labeled training examples. In reinforcement learning, ANNs can 
use TD errors to learn value functions, or they can aim to maximize expected reward as in a gradient 
bandit (Section 2.8) or a policy-gradient algorithm (Chapter 13). In all of these cases it is necessary to 
estimate how a change in each connection weight would influence the network’s overall performance, in 
other words, to estimate the partial derivative of an objective function with respect to each weight, given 
the current values of all the network’s weights. The gradient is the vector of these partial derivatives. 

The most successful way to do this for ANNs with hidden layers (provided the units have differen¬ 
tiable activation functions) is the backpropagation algorithm, which consists of alternating forward and 
backward passes through the network. Each forward pass computes the activation of each unit given the 
current activations of the network’s input units. After each forward pass, a backward pass efficiently 
computes a partial derivative for each weight. (As in other stochastic gradient learning algorithms, 
the vector of these partial derivatives is an estimate of the true gradient.) In Section 15.10 we discuss 
methods for training ANNs with hidden layers that use reinforcement learning principles instead of 
backpropagation. These methods are less efficient than the backpropagation algorithm, but they may 
be closer to how real neural networks learn. 

The backpropagation algorithm can produce good results for shallow networks having 1 or 2 hidden 
layers, but it may not work well for deeper ANNs. In fact, training a network with k + 1 hidden 
layers can actually result in poorer performance than training a network with k hidden layers, even 
though the deeper network can represent all the functions that the shallower network can (Bengio, 
2009). Explaining results like these is not easy, but several factors are important. First, the large 
number of weights in a typical deep ANN makes it difficult to avoid the problem of overfitting, that 
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is, the problem of failing to generalize correctly to cases on which the network has not been trained. 
Second, backpropagation does not work well for deep ANNs because the partial derivatives computed 
by its backward passes either decay rapidly toward the input side of the network, making learning by 
deep layers extremely slow, or the partial derivatives grow rapidly toward the input side of the network, 
making learning unstable. Methods for dealing with these problems are largely responsible for many 
impressive recent results achieved by systems that use deep ANNs. 

Overfitting is a problem for any function approximation method that adjusts functions with many 
degrees of freedom on the basis of limited training data. It is less of a problem for on-line reinforcement 
learning that does not rely on limited training sets, but generalizing effectively is still an important 
issue. Overfitting is a problem for ANNs in general, but especially so for deep ANNs because they tend 
to have very large numbers of weights. Many methods have been developed for reducing overfitting. 
These include stopping training when performance begins to decrease on validation data different from 
the training data (cross validation), modifying the objective function to discourage complexity of the 
approximation (regularization), and introducing dependencies among the weights to reduce the number 
of degrees of freedom (e.g., weight sharing). 

A particularly effective method for reducing overfitting by deep ANNs is the dropout method in¬ 
troduced by Srivastava, Hinton, Krizhevsky, Sutskever, and Salakhutdinov (2014). During training, 
units are randomly removed from the network (dropped out) along with their connections. This can be 
thought of as training a large number of “thinned” networks. Combining the results of these thinned 
networks at test time is a way to improve generalization performance. The dropout method efficiently 
approximates this combination by multiplying each outgoing weight of a unit by the probability that 
that unit was retained during training. Srivastava et al. found that this method significantly improves 
generalization performance. It encourages individual hidden units to learn features that work well with 
random collections of other features. This increases the versatility of the features formed by the hidden 
units so that the network does not overly specialize to rarely-occurring cases. 

Hinton, Osindero, and Teh (2006) took a major step toward solving the problem of training the 
deep layers of a deep ANN in their work with deep belief networks, layered networks closely related 
to the deep ANNs discussed here. In their method, the deepest layers are trained one at a time using 
an unsupervised learning algorithm. Without relying on the overall objective function, unsupervised 
learning can extract features that capture statistical regularities of the input stream. The deepest layer 
is trained first, then with input provided by this trained layer, the next deepest layer is trained, and so 
on, until the weights in all, or many, of the network’s layers are set to values that now act as initial values 
for supervised learning. The network is then fine-tuned by backpropagation with respect to the overall 
objective function. Studies show that this approach generally works much better than backpropagation 
with weights initialized with random values. The better performance of networks trained with weights 
initialized this way could be due to many factors, but one idea is that this method places the network 
in a region of weight space from which a gradient-based algorithm can make good progress. 

Batch normalization (Ioffe and Szegedy, 2015) is another technique that makes it easier to train 
deep ANNs. It has long been known that ANN learning is easier if the network input is normalized, for 
example, by adjusting each input variable to have zero mean and unit variance. Batch normalization for 
training deep ANNs normalizes the output of deep layers before they feed into the following layer. Ioffe 
and Szegedy (2015) used statistics from subsets, or “mini-batches,” of training examples to normalize 
these between-layer signals to improve the learning rate of deep ANNs. 

Another technique useful for training deep ANNs is deep residual learning (He, Zhang, Ren, and 
Sun, 2016). Sometimes it is easier to learn how a function differs from the identity function than to 
learn the function itself. Then adding this difference, or residual function, to the input produces the 
desired function. In deep ANNs, a block of layers can be made to learn a residual function simply by 
adding shortcut, or skip, connections around the block. These connections add the input to the block 
to its output, and no additional weights are needed. He et al. (2016) evaluated this method using deep 
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convolutional networks with skip connections around every pair of adjacent layers, finding substantial 
improvement over networks without the skip connections on benchmark image classification tasks. Both 
batch normalization and deep residual learning were used in the reinforcement learning application to 
the game of Go that we describe in Chapter 16. 

A type of deep ANN that has proven to be very successful in applications, including impressive rein¬ 
forcement learning applications (Chapter 16), is the deep convolutional network. This type of network 
is specialized for processing high-dimensional data arranged in spatial arrays, such as images. It was 
inspired by how early visual processing works in the brain (LeCun, Bottou, Bengio and Haffner, 1998). 
Because of its special architecture, a deep convolutional network can be trained by backpropagation 
without resorting to methods like those described above to train the deep layers. 

Figure 9.15 illustrates the architecture of a deep convolutional network. This instance, from LeCun 
et al. (1998), was designed to recognize hand-written characters. It consists of alternating convolutional 
and subsampling layers, followed by several fully connected final layers. Each convolutional layer pro¬ 
duces a number of feature maps. A feature map is a pattern of activity over an array of units, where 
each unit performs the same operation on data in its receptive field, which is the part of the data it 
“sees” from the preceding layer (or from the external input in the case of the first convolutional layer). 
The units of a feature map are identical to one another except that their receptive fields, which are all 
the same size and shape, are shifted to different locations on the arrays of incoming data. Units in the 
same feature map share the same weights. This means that a feature map detects the same feature 
no matter where it is located in the input array. In the network in Figure 9.15, for example, the first 
convolutional layer produces 6 feature maps, each consisting of 28 x 28 units. Each unit in each feature 
map has a 5 x 5 receptive field, and these receptive fields overlap (in this case by four columns and 
four rows). Consequently, each of the 6 feature maps is specified by just 25 adjustable weights. 

The subsampling layers of a deep convolutional network reduce the spatial resolution of the feature 
maps. Each feature map in a subsampling layer consists of units that average over a receptive field 
of units in the feature maps of the preceding convolutional layer. For example, each unit in each of 
the 6 feature maps in the first subsampling layer of the network of Figure 9.15 averages over a 2 x 2 
non-overlapping receptive field over one of the feature maps produced by the first convolutional layer, 
resulting in six 14 x 14 feature maps. Subsampling layers reduce the network’s sensitivity to the spatial 
locations of the features detected, that is, they help make the network’s responses spatially invariant. 
This is useful because a feature detected at one place in an image is likely to be useful at other places 
as well. 

Advances in the design and training of ANNs—of which we have only mentioned a few—all contribute 
to reinforcement learning. Although current reinforcement learning theory is mostly limited to meth- 



INPUT 

32x32 


16@5x5 

la y er F6: layer OUTPUT 
84 10 


Convolutions 


Subsampling 


Convolutions Subsampling 


j Gaussian connections 
Full connection 


Cl: feature maps 
6@28x28 


C3:f. maps 16@ 10x10 

S4:f. 

S2: f. maps ft 


Figure 9.15: Deep Convolutional Network. Republished with permission of Proceedings of the IEEE, from 
Gradient-based learning applied to document recognition, LeCun, Bottou, Bengio, and Haffner, volume 86, 
1998; permission conveyed through Copyright Clearance Center, Inc. 
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ods using tabular or linear function approximation methods, the impressive performances of notable 
reinforcement learning applications owe much of their success to nonlinear function approximation by 
multi-layer ANNs. We discuss several of these applications in Chapter 16. 

9.7 Least-Squares TD 

All the methods we have discussed so far in this chapter have required computation per time step 
proportional to the number of parameters. With more computation, however, one can do better. In 
this section we present a method for linear function approximation that is arguably the best that can 
be done for this case. 

As we established in Section 9.4 TD(0) with linear function approximation converges asymptotically 
(for appropriately decreasing step sizes) to the TD fixed point: 

w T d = A x b, 

where 

A = E[x t (x t — 7X t+ i) ] and b = E[i? t+ ix t ]. 

Why, one might ask, must we compute this solution iteratively? This is wasteful of data! Could one 
not do better by computing estimates of A and b, and then directly computing the TD fixed point? 
The Least-Squares TD algorithm, commonly known as LSTD , does exactly this. It forms the natural 
estimates 

t -1 t -1 

At = X fc (x fc - 7X fc+1 ) T + el and b t = ^R t + ix fc , (9.19) 

k -0 k —0 

where I is the identity matrix, and el, for some small e > 0, ensures that A t is always invertible. It 
might seem that these estimates should both be divided by t + 1, and indeed they should; as defined 
here, these are really estimates of t + 1 times A and t + 1 times b. However, the t + 1 factor will not 
matter, as when we use these estimates we will be effectively dividing one by the other. LSTD estimates 
the TD fixed point as 

Wf = A)" 1 b/ . (9.20) 

This algorithm is the most data efficient form of linear TD(0), but it is also more expensive compu¬ 
tationally. Recall that semi-gradient TD(0) requires memory and per-step computation that is only 

0(d). 

How complex is LSTD? As it is written above the complexity seems to increase with t, but the two 
approximations in (9.19) could be implemented incrementally using the techniques we have covered 
earlier (e.g., in Chapter 2) so that they can be done in constant time per step. Even so, the update for 
A t would involve an outer product (a column vector times a row vector) and thus would be a matrix 
update; its computational complexity would be 0(d 2 ), and of course the memory required to hold the 
A t matrix would be 0(d 2 ). 

A potentially greater problem is that our final computation (9.20) uses the inverse of A t , and the 
computational complexity of a general inverse computation is 0(d 3 ). Fortunately, an inverse of a 
matrix of our special form—a sum of outer products—can also be updated incrementally with only 
0(d 2 ) computations, as 

A^ 1 = (A f _i + x t (x t - 7 X( +:l ) t ) (from (9.19)) 

= A~_ 1 1 x t (x t -7x t+ i) A-\ 

11 1 + (x t - 7X t+ i) T A“_ 1 1 x t ’ 


(9.21) 
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for t > 0, with Aq = el. Although the identity (9.21), known as the Sherman-Morris on formula , is 
superficially complicated, it involves only vector-matrix and vector-vector multiplications and thus is 
only O(cP). Thus we can store the inverse matrix A)” , maitain it with (9.21), and then use it in (9.20), 
all with only 0(d 2 ) memory and per-step computation. The complete algorithm is given in the box 
below. 


LSTD for estimating v ~ v n ( 0{d 2 ) version) 


Input: feature representation x(s) £ for all s £ S, x(terminal) = 0 

A -1 ■£- £ _1 I An d x d matrix 

b -f- 0 An d-dinrensional vector 

Repeat (for each episode): 

Initialize S'; obtain corresponding x 
Repeat (for each step of episode): 

Choose A ~ 7r(-|S) 

Take action A , observe R, S'; obtain corresponding x' 
v «— A^ 1 (x — qx') 

A^ 1 A -1 — (A _1 x)v T /(l + v T x) 

b b + I?x 

e £- A?'b 

S S'; x <- x' 
until S' is terminal 


Of course, 0(d 2 ) is still significantly more expensive than the 0{d) of semi-gradient TD. Whether 
the greater data efficiency of LSTD is worth this computational expense depends on how large d is, 
how important it is to learn quickly, and the expense of other parts of the system. The fact that 
LSTD requires no step-size parameter is sometimes also touted, but the advantage of this is probably 
overstated. LSTD does not require a step size, but it does requires e; if £ is chosen too small the 
sequence of inverses can vary wildly, and if e is chosen too large then learning is slowed. In addition, 
LSTD’s lack of a step size parameter means that it never forgets. This is sometimes desirable, but it 
is problematic if the target policy 7r changes as it does in reinforcement learning and GPL In control 
applications, LSTD typically has to be combined with some other mechanism to induce forgeting, 
mooting any initial advantage of not requiring a step size parameter. 


9.8 Memory-based Function Approximation 

So far we have discussed the parametric approach to approximating value functions. In this approach, 
a learning algorithm adjusts the parameters of a functional form intended to approximate the value 
function over a problem’s entire state space. Each update, s K > g, is a training example used by the 
learning algorithm to change the parameters with the aim of reducing the approximation error. After 
the update, the training example can be discarded (although it might be saved to be used again). When 
an approximate value of a state (which we will call the query state) is needed, the function is simply 
evaluated at that state using the latest parameters produced by the learning algorithm. 

Memory-based function approximation methods are very different. They simply save training ex¬ 
amples in memory as they arrive (or at least save a subset of the examples) without updating any 
parameters. Then, whenever a query state’s value estimate is needed, a set of examples is retrieved 
from memory and used to compute a value estimate for the query state. This approach is sometimes 
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called lazy learning because processing training examples is postponed until the system is queried to 
provide an output. 

Memory-based function approximation methods are prime examples of nonparametric methods. Un¬ 
like parametric methods, the approximating function’s form is not limited to a fixed parameterized 
class of functions, such as linear functions or polynomials, but is instead determined by the training 
examples themselves, together with some means for combining them to output estimated values for 
query states. As more training examples accumulate in memory, one expects nonparametric methods 
to produce increasingly accurate approximations of any target function. 

There are many different memory-based methods depending on how the stored training examples 
are selected and how they are used to respond to a query. Here, we focus on local-learning methods 
that approximate a value function only locally in the neighborhood of the current query state. These 
methods retrieve a set of training examples from memory whose states are judged to be the most 
relevant to the query state, where relevance usually depends on the distance between states: the closer 
a training example’s state is to the query state, the more relevant it is considered to be, where distance 
can be defined in many different ways. After the query state is given a value, the local approximation 
is discarded. 

The simplest example of the memory-based approach is the nearest neighbor method, which simply 
finds the example in memory whose state is closest to the query state and returns that example’s value 
as the approximate value of the query state. In other words, if the query state is s, and s' i—>- <7 is the 
example in memory in which s' is the closest state to s, then g is returned as the approximate value 
of s. Slightly more complicated are weighted average methods that retrieve a set of nearest neighbor 
examples and return a weighted average of their target values, where the weights generally decrease 
with increasing distance between their states and the query state. Locally weighted regression is similar, 
but it fits a surface to the values of a set of nearest states by means of a parametric approximation 
method that minimizes a weighted error measure like (9.1), where the weights depend on distances from 
the query state. The value returned is the evaluation of the locally-fitted surface at the query state, 
after which the local approximation surface is discarded. 

Being nonparametric, memory-based methods have the advantage over parametric methods of not 
limiting approximations to pre-specified functional forms. This allows accuracy to improve as more data 
accumulates. Memory-based local approximation methods have other properties that make them well 
suited for reinforcement learning. Because trajectory sampling is of such importance in reinforcement 
learning, as discussed in Section 8 . 6 , memory-based local methods can focus function approximation 
on local neighborhoods of states (or state-action pairs) visited in real or simulated trajectories. There 
may be no need for global approximations because many areas of the state space will never (or almost 
never) be reached. In addition, memory-based methods allow an agent’s experience to have a relatively 
immediate affect on value estimates in the neighborhood if its environment’s current state, in contrast 
with a parametric method’s need to incrementally adjust parameters of a global approximation. 

Avoiding global approximations is also a way to address the curse of dimensionality. For example, 
for a state space with k dimensions, a tabular method storing a global approximation requires memory 
exponential in k. On the other hand, in storing examples for a memory-based method, each example 
requires memory proportional to /c, and the memory required to store, say, n examples is linear in n. 
Nothing is exponential in k or n. Of course, the critical remaining issue is whether a memory-based 
method can answer queries quickly enough to be useful to an agent. A related concern is how speed 
degrades as the size of the memory grows. Finding nearest neighbors in a large database can take too 
long to be practical in many applications. 

Proponents of memory-based methods have developed ways to accelerate the nearest neighbor search. 
Using parallel computers or special purpose hardware is one approach; another is the use of special multi¬ 
dimensional data structures to store the training data. One data structure studied for this application 
is the k-d tree (short for /c-dimensional tree), which recursively splits a /c-dinrensional space into regions 
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arranged as nodes of a binary tree. Depending on the amount of data and how it is distributed over 
the state space, nearest-neighbor search using k-d trees can quickly eliminate large regions of the space 
in the search for neighbors, making the searches feasible in some problems where naive searches would 
take too long. 

Locally weighted regression additionally requires fast ways to do the local regression computations 
which have to be repeated to answer each query. Researchers have developed many ways to address 
these problems, including methods for forgetting entries in order to keep the size of the database within 
bounds. The Bibliographic and Historical Comments section at the end of this chapter points to some of 
the relevant literature, including a selection of papers describing applications of memory-based learning 
to reinforcement learning. 


9.9 Kernel-based Function Approximation 

Memory-based methods such as the weighted average and locally weighted regression methods described 
above depend on assigning weights to examples s' K > g in the database depending on the distance 
between s' and a query states s. The function that assigns these weights is called a kernel function, 
or simply a kernel. In the weighted average and locally weighted regressions methods, for example, a 
kernel function k : K. —»• K. assigns weights to distances between states. More generally, weights do not 
have to depend on distances; they can depend on some other measure of similarity between states. In 
this case, k : 8 x § —> R, so that k(s, s') is the weight given to data about s' in its influence on answering 
queries about s. 

Viewed slightly differently, k(s, s') is a measure of the strength of generalization from s' to s. Kernel 
functions numerically express how relevant knowledge about any state is to any other state. As an 
example, the strengths of generalization for tile coding shown in Figure 9.11 correspond to different 
kernel functions resulting from uniform and asymmetrical tile offsets. Although tile coding does not 
explicitly use a kernel function in its operation, it generalizes according to one. In fact, as we discuss 
more below, the strength of generalization resulting from linear parametric function approximation can 
always be described by a kernel function. 

Kernel regression is the memory-based method that computes a kernel weighted average of the 
targets of all examples stored in memory, assigning the result to the query state. If D is the set of 
stored examples, and g(s') denotes the target for state s' in a stored example, then kernel regression 
approximates the target function, in this case a value function depending on CD, as 

v(s,T>) = k ( s ’ s ')ff( s ')- (9-22) 

s'6® 

The weighted average method described above is a special case in which k(s, s') is non-zero only when 
s and s' are close to one another so that the sum need not be computed over all of CD. 

A common kernel is the Gaussian radial basis function (RBF) used in RBF function approximation as 
described in Section 9.5.5. In the method described there, RBFs are features whose centers and widths 
are either fixed from the start, with centers presumably concentrated in areas where many examples are 
expected to fall, or are adjusted in some way during learning. Barring methods that adjust centers and 
widths, this is a linear parametric method whose parameters are the weights of each RBF, which are 
typically learned by stochastic gradient, or semi-gradient, descent. The form of the approximation is a 
linear combination of the pre-determined RBFs. Kernel regression with an RBF kernel differs from this 
in two ways. First, it is memory-based: the RBFs are centered on the states of the stored examples. 
Second, it is nonparametric: there are no parameters to learn; the response to a query is given by (9.22). 

Of course, many issues have to be addressed for practical implementation of kernel regression, issues 
that are beyond the scope or our brief discussion. However, it turns out that any linear parametric 
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regression method like those we described in Section 9.4, with states represented by feature vectors 
x(s) = (aq(s), X2 (s),..., Xd(s)) T , can be recast as kernel regression where k(s, s') is the inner product 
of the feature vector representations of s and s'; that is 

k(s, s') = x(s) T x(s'). (9.23) 

Kernel regression with this kernel function produces the same approximation that a linear parametric 
method would if it used these feature vectors and learned with the same training data. 

We skip the mathematical justification for this, which can be found in any modern machine learning 
text, such as Bishop (2006), and simply point out an important implication. Instead of constructing 
features for linear parametric function approximators, one can instead construct kernel functions di¬ 
rectly without referring at all to feature vectors. Not all kernel functions can be expressed as inner 
products of feature vectors as in (9.23), but a kernel function that can be expressed like this can offer 
significant advantages over the equivalent parametric method. For many sets of feature vectors, (9.23) 
has a compact functional form that can be evaluated without any computation taking place in the 
d-dimensional feature space. In these cases, kernel regression is much less complex than directly using a 
linear parametric method with states represented by these feature vectors. This is the so-called “kernel 
trick” that allows effectively working in the high-dimension of an expansive feature space while actually 
working only with the set of stored training examples. The kernel trick is the basis of many machine 
learning methods, and researchers have shown how it can sometimes benefit reinforcement learning. 


9.10 Looking Deeper at On-policy Learning: Interest and Em¬ 
phasis 

The algorithms we have considered so far in this chapter have treated all the states encountered equally, 
as if they were all equally important. In some cases, however, we are more interested in some states 
than others. In discounted episodic problems, for example, we may be more interested in accurately 
valuing early states in the episode than in later states where discounting may have made the rewards 
much less important to the value of the start state. Or, if an action-value function is being learned, 
it may be less important to accurately value poor actions whose value is much less than the greedy 
action. Function approximation resources are always limited, and if they were used in a more targeted 
way, then performance could be improved. 

One reason we have treated all states encountered equally is that then we are updating according to 
the on-policy distribution, for which stronger theoretical results are available for semi-gradient methods. 
Recall that the on-policy distribution was defined as the distribution of states encountered in an MDP 
while following the target policy. Now we will generalize this concept significantly. Rather than having 
one on-policy distribution for the MDP, we will have many. All of them will have in common that they 
are a distribution of states encountered in trajectories while following the target policy, but they will 
vary in how the trajectories are, in a sense, initiated. 

We now introduce some new concepts. First we introduce a non-negative scalar measure, a random 
variable It called interest, indicating the degree to which we are interested in accurately valuing the 
state (or state-action pair) at time t. If we don’t care at all about the state, then the interest should 
be zero; if we fully care, it might be one, though it is formally allowed take any non-negative value. 
The interest can be set in any causal way; for example, it may depend on the trajectory up to time t or 
the learned parameters at time t. The distribution /i in the VE (9.1) is then defined as the distribution 
of states encountered while following the target policy, weighted by the interest. Second, we introduce 
another non-negative scalar random variable, the emphasis M t . This scalar multiplies the learning 
update and thus emphasizes or de-emphasizes the learning done at time t. The general n-step learning 
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rule, replacing (9.15), is 

w t+n = w t+n _i + aM t [Gf.t+n - v(S t , w t+n _i)] S7v(S t , w t+n _i), 0 < t < T, (9.24) 

with the n-step return given by (9.16) and the emphasis determined recursively from the interest by: 

M t = I t + 7 0 < t < T, (9.25) 

with M t = 0, for all t < 0. These equations are taken to include the Monte Carlo case, for which 
Gf.t+n = G t , all the updates are made at end of the episode, n = T — t, and M t = I t . 

Example 9.4 illustrates how interest and emphasis can result in more accurate value estimates. 


Example 9.4: Interest and Emphasis 


To see the potential benefits of using interest and emphasis, consider the four-state Markov 
reward process shown below: 


i — 1 i = 0 i = 0 i = 0 



Vtt = 4 V n = 3 V n = 2 Ujr = 1 


Episodes start in the leftmost state, then transition one state to the right, with a reward of +1, 
on each step until the terminal state is reached. The true value of the first state is thus 4, of the 
second state 3, and so on as shown below each state. These are the true values; the estimated 
values can only approximate these because they are constrained by the parameterization. There 
are two components to the parameter vector w = (wi ,W2)~, and the parameterization is as 
written inside each state. The estimated values of the first two states are given by W\ alone 
and thus must be the same even though their true values are different. Similarly, the estimated 
values of the third and fourth states are given by W 2 alone and must be the same even though 
their true values are different. Suppose that we are interested in accurately valuing only the 
leftmost state; we assign it an interest of 1 while all the other states are assigned an interest of 
0, as indicated above the states. 

First consider applying gradient Monte Carlo algorithms to this problem. The algorithms 
presented earlier in this chapter that do not take into account interest and emphasis (in (9.7) 
and the box on page 165) will converge (for decreasing step sizes) to the parameter vector 
Woo = (3.5,1.5), which gives the first state—the only one we are interested in -a value of 3.5 
(i.e., intermediate between the true values of the first and second states). The methods presented 
in this section that do use interest and emphasis, on the other hand, will learn the value of the 
first state exactly correctly; w\ will converge to 4 while W2 will never be updated because the 
emphasis is zero in all states save the leftmost. 

Now consider applying two-step semi-gradient TD methods. The methods from earlier in 
this chapter without interest and emphasis (in (9.15) and (9.16) and the box on page 171) will 
again converge to = (3.5,1.5), while the methods with interest and emphasis converge to 
Woo = (4, 2). The latter produces the exactly correct values for the first state and for the third 
state (which the first state bootstraps from) while never making any updates corresponding to 
the second or fourth states. 
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9.11 Summary 

Reinforcement learning systems must be capable of generalization if they are to be applicable to artificial 
intelligence or to large engineering applications. To achieve this, any of a broad range of existing methods 
for supervised-learning function approximation can be used simply by treating each update as a training 
example. 

Perhaps the most suitable supervised learning methods are those using parameterized function ap¬ 
proximation. , in which the policy is parameterized by a weight vector w. Although the weight vector has 
many components, the state space is much larger still and we must settle for an approximate solution. 
We defined VE(w) as a measure of the error in the values v T r w (s) for a weight vector w under the 
on-policy distribution, p. The VE gives us a clear way to rank different value-function approximations 
in the on-policy case. 

To find a good weight vector, the most popular methods are variations of stochastic gradient descent 
(SGD). In this chapter we have focused on the on-policy case with a fixed policy , also known as policy 
evaluation or prediction; a natural learning algorithm for this case is n-step semi-gradient TD , which 
includes gradient Monte Carlo and semi-gradient TD(0) algorithms as the special cases when n = oo and 
n = 1 respectively. Semi-gradient TD methods are not true gradient methods. In such bootstrapping 
methods (including DP), the weight vector appears in the update target, yet this is not taken into 
account in computing the gradient— thus they are semi-gradient methods. As such, they cannot rely 
on classical SGD results. 

Nevertheless, good results can be obtained for semi-gradient methods in the special case of linear 
function approximation, in which the value estimates are sums of features times corresponding weights. 
The linear case is the most well understood theoretically and works well in practice when provided 
with appropriate features. Choosing the features is one of the most important ways of adding prior 
domain knowledge to reinforcement learning systems. They can be chosen as polynomials, but this case 
generalizes poorly in the online learning setting typically considered in reinforcement learning. Better 
is to choose features according the Fourier basis, or according to some form of coarse coding with sparse 
overlapping receptive fields. Tile coding is a form of coarse coding that is particularly computationally 
efficient and flexible. Radial basis functions are useful for one- or two-dimensional tasks in which a 
smoothly varying response is important. LSTD is the most data-efficient linear TD prediction method, 
but requires computation proportional to the square of the number of weights, whereas all the other 
methods are of complexity linear in the number of weights. Nonlinear methods include artificial neural 
networks trained by backpropagation and variations of SGD; these methods have become very popular 
in recent years under the name deep reinforcement learning. 

Linear semi-gradient n-step TD is guaranteed to converge under standard conditions, for all n, to 
a VE that is within a bound of the optimal error. This bound is always tighter for higher n and 
approaches zero as n —> oo. However, in practice that choice results in very slow learning, and some 
degree of bootstrapping (1 < n < oo) is usually preferrable. 


Bibliographical and Historical Remarks 

Generalization and function approximation have always been an integral part of reinforcement learning. 
Bertsekas and Tsitsiklis (1996), Bertsekas (2012), and Sugiyama et al. (2013) present the state of 
the art in function approximation in reinforcement learning. Some of the early work with function 
approximation in reinforcement learning is discussed at the end of this section. 

9.3 Gradient-descent methods for minimizing mean-squared error in supervised learning are well 
known. Widrow and Hoff (1960) introduced the least-mean-square (LMS) algorithm, which is 
the prototypical incremental gradient-descent algorithm. Details of this and related algorithms 
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are provided in many texts (e.g., Widrow and Stearns, 1985; Bishop, 1995; Duda and Hart, 
1973). 

Semi-gradient TD(0) was first explored by Sutton (1984, 1988), as part of the linear TD(A) 
algorithm that we will treat in Chapter 12. The term “semi-gradient” to describe these boot¬ 
strapping methods is new to the second edition of this book. 

The earliest use of state aggregation in reinforcement learning may have been Micliie and 
Chambers’s BOXES system (1968). The theory of state aggregation in reinforcement learning 
has been developed by Singh, Jaakkola, and Jordan (1995) and Tsitsiklis and Van Roy (1996). 
State aggregation has been used in dynamic programming from its earliest days (e.g., Bellman, 
1957a). 

9.4 Sutton (1988) proved convergence of linear TD(0) in the mean to the minimal VE solution for 
the case in which the feature vectors, {x(s) : s € S}, are linearly independent. Convergence 
with probability 1 was proved by several researchers at about the same time (Peng, 1993; Dayan 
and Sejnowski, 1994; Tsitsiklis, 1994; Gurvits, Lin, and Hanson, 1994). In addition, Jaakkola, 
Jordan, and Singh (1994) proved convergence under on-line updating. All of these results 
assumed linearly independent feature vectors, which implies at least as many components to 
Wf as there are states. Convergence for the more important case of general (dependent) feature 
vectors was first shown by Dayan (1992). A significant generalization and strengthening of 
Dayan’s result was proved by Tsitsiklis and Van Roy (1997). They proved the main result 
presented in this section, the bound on the asymptotic error of linear bootstrapping methods. 

9.5 Our presentation of the range of possibilities for linear function approximation is based on that 
by Barto (1990). 

9.5.2 Konidaris, Osentoski, and Thomas (2011) introduced the Fourier basis in a simple form suit¬ 
able for reinforcement learning problems with multi-dimensional continuous state spaces and 
functions that do not have to be periodic. 

9.5.3 The term coarse coding is due to Hinton (1984), and our Figure 9.6 is based on one of his 
figures. Waltz and Fu (1965) provide an early example of this type of function approximation 
in a reinforcement learning system. 

9.5.4 Tile coding, including hashing, was introduced by Albus (1971, 1981). He described it in terms 
of his “cerebellar model articulator controller,” or CMAC, as tile coding is sometimes known in 
the literature. The term “tile coding” was new to the first edition of this book, though the idea 
of describing CMAC in these terms is taken from Watkins (1989). Tile coding has been used 
in many reinforcement learning systems (e.g., Shewchuk and Dean, 1990; Lin and Kim, 1991; 
Miller, Scalera, and Kim, 1994; Sofge and White, 1992; Tharn, 1994; Sutton, 1996; Watkins, 
1989) as well as in other types of learning control systems (e.g., Kraft and Campagna, 1990; 
Kraft, Miller, and Dietz, 1992). This section draws heavily on the work of Miller and Glanz 
(1996). 

9.5.5 Function approximation using radial basis functions has received wide attention ever since being 
related to neural networks by Broomhead and Lowe (1988). Powell (1987) reviewed earlier uses 
of RBFs, and Poggio and Girosi (1989, 1990) extensively developed and applied this approach. 

9.6 The introduction of the threshold logic unit as an abstract model neuron by McCulloch and 
Pitts (1943) was the beginning of artificial neural networks (ANNs). The history of ANNs as 
learning methods for classification or regression has passed through several stages: roughly, the 



194 


CHAPTER 9. ON-POLICY PREDICTION WITH APPROXIMATION 


Perceptron (Rosenblatt, 1962) and AD ALINE (ADAptive LINear Element) (Widrow and Hoff, 
1960) stage of learning by single-layer ANNs, the error-backpropagation stage (Werbos, 1974; 
LeCun, 1985; Parker, 1985; Rumelhart, Hinton, and Williams, 1986) of learning by multi-layer 
ANNs, and the current deep-learning stage with its emphasis on representation learning (e.g., 
Bengio, Courville, and Vincent, 2012; Goodfellow, Bengio, and Courville, 2016). Examples of 
the many books on ANNs are Haykin (1994), Bishop (1995), and Ripley (2007). 

ANNs as function approximation for reinforcement learning goes back to the early neural net¬ 
works of Farley and Clark (1954), who used reinforcement-like learning to modify the weights of 
linear threshold functions representing policies. Widrow, Gupta, and Maitra (1973) presented 
a neuron-like linear threshold unit implementing a learning process they called learning with 
a critic or selective bootstrap adaptation, a reinforcement-learning variant of the ADALINE 
algorithm. Werbos (1974, 1987, 1994) developed an approach to prediction and control that 
uses ANNs trained by error backpropation to learn policies and value functions using TD-like 
algorithms. Barto, Sutton, and Brouwer (1981) and Barto and Sutton (1981b) extended the 
idea of an associative memory network (e.g., Kohonen, 1977; Anderson, Silverstein, Ritz, and 
Jones, 1977) to reinforcement learning. Barto, Anderson, and Sutton (1982) used a two-layer 
ANN to learn a nonlinear control policy, and emphasized the first layer’s role of learning a 
suitable representation. Hampson (1983, 1989) was an early proponent of multilayer ANNs 
for learning value functions. Barto, Sutton, and Anderson (1983) presented an actor-critic 
algorithm in the form of an ANN learning to balance a simulated pole (see Sections 15.7 and 
15.8). Barto and Anandan (1985) introduced a stochastic version of Widrow et al.’s (1973) 
selective bootstrap algorithm called the associative reward-penalty (Ar.-p) algorithm. Barto 
(1985, 1986) and Barto and Jordan (1987) described multi-layer ANNs consisting of Ar_p 
units trained with a globally-broadcast reinforcement signal to learn classification rules that 
are not linearly separable. Barto (1985) discussed this approach to ANNs and how this type 
of learning rule is related to others in the literature at that time. (See Section 15.10 for addi¬ 
tional discussion of this approach to training multi-layer ANNs.) Anderson (1986, 1987, 1989) 
evaluated numerous methods for training multilayer ANNs and showed that an actor-critic 
algorithm in which both the actor and critic were implemented by two-layer ANNs trained 
by error backpropagation outperformed single-layer ANNs in the pole-balancing and tower of 
Hanoi tasks. Williams (1988) described several ways that backpropagation and reinforcement 
learning can be combined for training ANNs. Gullapalli (1990) and Williams (1992) devised 
reinforcement learning algorithms for neuron-like units having continuous, rather than binary, 
outputs. Barto, Sutton, and Watkins (1990) argued that ANNs can play significant roles for 
approximating functions required for solving sequential decision problems. Williams (1992) 
related REINFORCE learning rules (Section 13.3) to the error backpropagation method for 
training multi-layer ANNs. Tesauro’s TD-Gammon (Tesauro 1992, 1994; Section 16.1) influen¬ 
tially demonstrated the learning abilities of TD(A) algorithm with function approximation by 
multi-layer ANNs in learning to play backgammon. The AlphaGo and AlphaGo Zero programs 
of Silver et al. (2016, 2017; Section 16.6) used reinforcement learning with deep convolutional 
ANNs in achieving impressive results with the game of Go. Schmidhuber (2015) reviews appli¬ 
cations of ANNs in reinforcement learning, including applications of recurrent ANNs. 

9.7 LSTD is due to Bradtke and Barto (see Bradtke, 1993, 1994; Bradtke and Barto, 1996; Bradtke, 
Ydstie, and Barto, 1994), and was further developed by Boyan (1999, 2002) and Nedic and 
Bertsekas (2003). The incremental update of the inverse matrix has been known at least since 
1949 (Sherman and Morrison, 1949). An extension of least-squares methods to control was 
introduced by Lagoudakis and Parr (2003). 

9.8 Our discussion of memory-based function approximation is largely based on the review of 
locally weighted learning by Atkeson, Moore, and Schaal (1997). Atkeson (1992) discussed the 
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use of locally weighted regression in memory-based robot learning and supplied an extensive 
bibliography covering the history of the idea. Stanfill and Waltz (1986) influentially argued for 
the importance of memory based methods in artificial intelligence, especially in light of parallel 
architectures then becoming available, such as the Connection Machine. Baird and Klopf (1993) 
introduced a novel memory-based approach and used it as the function approximation method 
for Q-learning applied to the pole-balancing task. Schaal and Atkeson (1994) applied locally 
weighted regression to a robot juggling control problem, where it was used to learn a system 
model. Ping (1995) used the pole-balancing task to experiment with several nearest-neighbor 
methods for approximating value functions, policies, and environment models. Tadepalli and 
Ok (1996) obtained promising results with locally-weighted linear regression to learn a value 
function for a simulated automatic guided vehicle task. Bottou and Vapnik (1996) demonstrated 
surprising efficiency of several local learning algorithms compared to non-local algorithms in 
some pattern recognition tasks, discussing the impact of local learning on generalization. 

Bentley (1975) introduced fc-d trees and reported observing average running time of 0(log n) 
for nearest neighbor search over n records. Friedman, Bentley, and Finkel (1977) clarified the 
algorithm for nearest neighbor search with k-d trees. Omohundro (1987) discussed efficiency 
gains possible with hierarchical data structures such as fc-d-trees. Moore, Schneider, and Deng 
(1997) introduced the use of k-d trees for efficient locally weighted regression. 

9.9 The origin of kernel regression is the method of potential functions of Aizerman, Braverman, 
and Rozonoer (1964). They likened the data to point electric charges of various signs and 
magnitudes distributed over space. The resulting electric potential over space produced by 
summing the potentials of the point charges corresponded to the interpolated surface. In this 
analogy, the kernel function is the potential of a point charge, which falls off as the reciprocal 
of the distance from the charge. Connell and Utgoff (1987) applied an actor-critic method 
to the pole-balancing task in which the critic approximated the value function using kernel 
regression with an inverse-distance weighting. Predating widespread interest in kernel regression 
in machine learning, these authors did not use the term kernel, but referred to “Shepard’s 
method” (Shepard, 1968). Other kernel-based approaches to reinforcement learning include 
those of Ormoneit and Sen (2002), Dietterich and Wang (2002), Xu, Xie, Hu, Nu, and Lu 
(2005), Taylor and Parr (2009), Barreto, Precup, and Pineau (2011), and Bhat, Farias, and 
Moallemi (2012). 

9.10 See the bibliographical notes for Emphatic-TD methods in Section 11.8. 


The earliest example we know of in which function approximation methods were used for learning 
value functions was Samuel’s checkers player (1959, 1967). Samuel followed Shannon’s (1950) suggestion 
that a value function did not have to be exact to be a useful guide to selecting moves in a game and that it 
might be approximated by linear combination of features. In addition to linear function approximation, 
Samuel experimented with lookup tables and hierarchical lookup tables called signature tables (Griffith, 
1966, 1974; Page, 1977; Biermann, Fairfield, and Beres, 1982). 

At about the same time as Samuel’s work, Bellman and Dreyfus (1959) proposed using function 
approximation methods with DP. (It is tempting to think that Bellman and Samuel had some influence 
on one another, but we know of no reference to the other in the work of either.) There is now a 
fairly extensive literature on function approximation methods and DP, such as multigrid methods and 
methods using splines and orthogonal polynomials (e.g., Bellman and Dreyfus, 1959; Bellman, Kalaba, 
and Kotkin, 1973; Daniel, 1976; Whitt, 1978; Reetz, 1977; Schweitzer and Seidmann, 1985; Chow and 
Tsitsiklis, 1991; Kushner and Dupuis, 1992; Rust, 1996). 

Holland’s (1986) classifier system used a selective feature-match technique to generalize evaluation 
information across state-action pairs. Each classifier matched a subset of states having specified values 
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for a subset of features, with the remaining features having arbitrary values (“wild cards”). These 
subsets were then used in a conventional state-aggregation approach to function approximation. Hol¬ 
land’s idea was to use a genetic algorithm to evolve a set of classifiers that collectively would implement 
a useful action-value function. Holland’s ideas influenced the early research of the authors on rein¬ 
forcement learning, but we focused on different approaches to function approximation. As function 
approximators, classifiers are limited in several ways. First, they are state-aggregation methods, with 
concomitant limitations in scaling and in representing smooth functions efficiently. In addition, the 
matching rules of classifiers can implement only aggregation boundaries that are parallel to the feature 
axes. Perhaps the most important limitation of conventional classifier systems is that the classifiers 
are learned via the genetic algorithm, an evolutionary method. As we discussed in Chapter 1, there 
is available during learning much more detailed information about how to learn than can be used by 
evolutionary methods. This perspective led us to instead adapt supervised learning methods for use 
in reinforcement learning, specifically gradient-descent and neural network methods. These differences 
between Holland’s approach and ours are not surprising because Holland’s ideas were developed during 
a period when neural networks were generally regarded as being too weak in computational power to 
be useful, whereas our work was at the beginning of the period that saw widespread questioning of 
that conventional wisdom. There remain many opportunities for combining aspects of these different 
approaches. 

Christensen and Korf (1986) experimented with regression methods for modifying coefficients of linear 
value function approximations in the game of chess. Chapman and Kaelbling (1991) and Tan (1991) 
adapted decision-tree methods for learning value functions. Explanation-based learning methods have 
also been adapted for learning value functions, yielding compact representations (Yee, Saxena, Utgoff, 
and Barto, 1990; Dietterich and Flann, 1995). 



Chapter 10 


On-policy Control with 
Approximation 


In this chapter we return to the control problem, now with parametric approximation of the action- 
value function q(s, a, w) w q*(s, a), where w £ is a finite-dimensional weight vector. We continue to 
restrict attention to the on-policy case, leaving off-policy methods to Chapter 11. The present chapter 
features the semi-gradient Sarsa algorithm, the natural extension of semi-gradient TD(0) (last chapter) 
to action values and to on-policy control. In the episodic case, the extension is straightforward, but in 
the continuing case we have to take a few steps backward and re-examine how we have used discounting 
to define an optimal policy. Surprisingly, once we have genuine function approximation we have to give 
up discounting and switch to a new “average-reward” formulation of the control problem, with new 
“differential” value functions. 

Starting first in the episodic case, we extend the function approximation ideas presented in the last 
chapter from state values to action values. Then we extend them to control following the general pattern 
of on-policy GPI, using e-greedy for action selection. We show results for n-step linear Sarsa on the 
Mountain Car problem. Then we turn to the continuing case and repeat the development of these ideas 
for the average-reward case with differential values. 


10.1 Episodic Semi-gradient Control 


The extension of the semi-gradient prediction methods of Chapter 9 to action values is straightforward. 
In this case it is the approximate action-value function, q ss q v , that is represented as a parameterized 
functional form with weight vector w. Whereas before we considered random training examples of the 
form St K > Ut, now we consider examples of the form St, At >->• Ut . The update target Ut can be any 
approximation of q K (St, At), including the usual backed-up values such as the full Monte Carlo return, 
G t , or any of the n-step Sarsa returns (7.4). The general gradient-descent update for action-value 
prediction is 


W(+l 


w t + a 


U t - q(S t ,A t ,w t ) 


Vq{S t ,A t ,w t ). 


( 10 . 1 ) 


For example, the update for the one-step Sarsa method is 


Wi+i 


w t + a 


Rt+ 1 +'yq{St+i,A t+1 ,w t ) - q(S t ,A t , w t ) 


Vq(S t ,A t , w t ). 


( 10 . 2 ) 


We call this method episodic semi-gradient one-step Sarsa. For a constant policy, this method converges 
in the same way that TD(0) does, with the same kind of error bound (9.14). 
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To form control methods, we need to couple such action-value prediction methods with techniques 
for policy improvement and action selection. Suitable techniques applicable to continuous actions, or 
to actions from large discrete sets, are a topic of ongoing research with as yet no clear resolution. 
On the other hand, if the action set is discrete and not too large, then we can use the techniques 
already developed in previous chapters. That is, for each possible action a available in the current state 
S t , we can compute q(S t ,a, w t ) and then find the greedy action A* t = argmax a q(S t , a, w t ). Policy 
improvement is then done (in the on-policy case treated in this chapter) by changing the estimation 
policy to a soft approximation of the greedy policy such as the e-greedy policy. Actions are selected 
according to this same policy. Pseudocode for the complete algorithm is given in the box. 


Episodic Semi-gradient Sarsa for Estimating q ss g* 


Input: a differentiable function (j^xdxR' 1 -}! 

Initialize value-function weights w G arbitrarily (e.g., w = 0) 
Repeat (for each episode): 

S, A <— initial state and action of episode (e.g., e-greedy) 
Repeat (for each step of episode): 

Take action A , observe R, S' 

If S' is terminal: 

w •*— w + a [R — ( 7 ( 5 , A, w )]Vq(S,A , w) 

Go to next episode 

Choose A' as a function of q(S', •, w) (e.g., e-greedy) 
w <— w + a [R + 7 q(S', A', w) — q(S , A, w)] \7q(S, A , w) 
S^S' 

A^ A' 


Example 10.1: Mountain Car Task Consider the task of driving an underpowered car up a steep 
mountain road, as suggested by the diagram in the upper left of Figure 10.1. The difficulty is that 
gravity is stronger than the car’s engine, and even at full throttle the car cannot accelerate up the steep 
slope. The only solution is to first move away from the goal and up the opposite slope on the left. 
Then, by applying full throttle the car can build up enough inertia to carry it up the steep slope even 
though it is slowing down the whole way. This is a simple example of a continuous control task where 
things have to get worse in a sense (farther from the goal) before they can get better. Many control 
methodologies have great difficulties with tasks of this kind unless explicitly aided by a human designer. 

The reward in this problem is —1 on all time steps until the car moves past its goal position at the 
top of the mountain, which ends the episode. There are three possible actions: full throttle forward 
(+1), full throttle reverse (—1), and zero throttle (0). The car moves according to a simplified physics. 
Its position, 27 , and velocity, 27 , are updated by 

27+1 = bound\ 27 + 27+1] 

27+1 = bound [27 + 0.001A t — 0.0025 cos( 327 )], 

where the bound operation enforces —1.2 < 27+1 < 0.5 and —0.07 < 27+1 < 0.07. In addition, when 
27+1 reached the left bound, 27+1 was reset to zero. When it reached the right bound, the goal was 
reached and the episode was terminated. Each episode started from a random position 27 G [—0.6, —0.4) 
and zero velocity. To convert the two continuous state variables to binary features, we used grid-tilings 
as in Figure 9.9. We used 8 tilings, with each tile covering l/ 8 th of the bounded distance in each 
dimension, and asymmetrical offsets as described in Section 9.5.4 . 1 The feature vectors x(s,a) created 


1 In particular, we used the tile-coding software, available on the web, version 3 (Python), with iht=IHT(4096) and 
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Figure 10.1: The Mountain Car task (upper left panel) and the cost-to-go function (— max a q(s, a, w)) learned 
during one run. 


by tile coding were then combined linearly with the parameter vector to approximate the action-value 
function: 

q(s, a , w) = w T x(s, a) = uii ■ Xi(s , a), (10.3) 

i 

for each pair of state, s, and action, a. 

Figure 10.1 shows what typically happens while learning to solve this task with this form of function 
approximation. * 2 Shown is the negative of the value function (the cost-to-go function) learned on a 
single run. The initial action values were all zero, which was optimistic (all true values are negative 
in this task), causing extensive exploration to occur even though the exploration parameter, e, was 0. 
This can be seen in the middle-top panel of the figure, labeled “Step 428”. At this time not even one 
episode had been completed, but the car has oscillated back and forth in the valley, following circular 
trajectories in state space. All the states visited frequently are valued worse than unexplored states, 
because the actual rewards have been worse than what was (unrealistically) expected. This continually 
drives the agent away from wherever it has been, to explore new states, until a solution is found. 

figrefcar-learning-curves shows several learning curves for semi-gradient Sarsa on this problem, with 
various step sizes. 

Exercise 10.1 Why have we not considered Monte Carlo methods in this chapter? □ 


tiles(iht, 8, [8*x/(0.5+1.2), 8*xdot/(0.07+0.07)] , A) to get the indices of the ones in the feature vector for state 
(x, xdot) and action A. 

2 This data is actually from the “semi-gradient Sarsa(A)” algorithm that we will not meet until Chapter 12, but 
semi-gradient Sarsa behaves similarly. 
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Mountain Car 

Steps per episode 
log scale 

averaged over 100 runs 



Figure 10.2: Mountain Car learning curves for the semi-gradient Sarsa method with tile-coding function 
approximation and e-greedy action selection. ■ 


10.2 n-step Semi-gradient Sarsa 

We can obtain an n-step version of episodic semi-gradient Sarsa by using an ?r-step return as the update 
target in the semi-gradient Sarsa update equation (10.1). The n-step return immediately generalizes 
from its tabular form (7.4) to a function approximation form: 

Gf.t+n = Rt+l + lRt+2 + • • • + 7™ 1 -Rt+n + 7"9('5't+ni w f+n-l)j n > 1, 0 < t < T — n, (10.4) 

with Gf.t+n = G t if t + n > T, as usual. The n-step update equation is 

Wt + „ = W t+n -1 + a [Gf t +n ~ q{S t , A t ,w t+n _ 1 )} Vg(S tl i4 tj w t+n _ 1 ), 0 <t <T. (10.5) 
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As we have seen before, performance is best if an intermediate level of bootstrapping is used, corre¬ 
sponding to an n larger than 1. Figure 10.3 shows how this algorithm tends to learn faster and obtain 
a better asymptotic performance at n = 8 than at n = 1 on the Mountain Car task. Figure 10.4 shows 
the results of a more detailed study of the effect of the parameters a and n on the rate of learning on 
this task. 


Mountain Car 

Steps per episode 
log scale 

averaged over 100 runs 



Figure 10.3: One-step vs multi-step performance of n-step semi-gradient Sarsa on the Mountain Car task. 
Good step sizes were used: a = 0.5/8 for n = 1 and a = 0.3/8 for n = 8. 


Mountain Car 

Steps per episode 
averaged over 
first 50 episodes 
and 100 runs 



Figure 10.4: Effect of the a and n on early performance of n-step semi-gradient Sarsa and tile-coding function 
approximation on the Mountain Car task. As usual, an intermediate level of bootstrapping (n = 4) performed 
best. These results are for selected a values, on a log scale, and then connected by straight lines. The standard 
errors ranged from 0.5 (less than the line width) for n = 1 to about 4 for n = 16, so the main effects are all 
statistically significant. 


Exercise 10.2 Give pseudocode for semi-gradient one-step Expected Sarsa for control. □ 

Exercise 10.3 Why do the results shown in Figure 10.4 have higher standard errors at large n than 
at low n? □ 
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10.3 Average Reward: A New Problem Setting for Continuing 
Tasks 

We now introduce a third classical setting—alongside the episodic and discounted settings—for formu¬ 
lating the goal in Markov decision problems (MDPs). Like the discounted setting, the average reward 
setting applies to continuing problems, problems for which the interaction between agent and environ¬ 
ment goes on and on forever without termination or start states. Unlike that setting, however, there is 
no discounting—the agent cares just as much about delayed rewards as it does about immediate reward. 
The average-reward setting is one of the major settings considered in the classical theory of dynamic 
programming and, though less often, in reinforcement learning. As we discuss in the next section, the 
discounted setting is problematic with function approximation, and thus the average-reward setting is 
needed to replace it. 

In the average-reward setting, the quality of a policy 7 r is defined as the average rate of reward while 
following that policy, which we denote as r( 7 r): 

1 h 

r( 7 r) = lim - | A 0:t _i ~ tt] 

h —>oo fl z ' 
t= 1 

= lim E [R t | A 0:t _i ~ 7 r], (10.6) 

t—¥ OO 

= X^( s )X^( a i s )X^( s, , r i s ’ a ) r , 

s a s',r 

where the expectations are conditioned on the prior actions, Aq, Ai, .. ., A t - 1 , being taken according to 
7 r, and is the steady-state distribution, /i w (s) = lim^oo PrjS't = s|Ao : t_i ~ 7 r}, which is assumed 
to exist and to be independent of Sq. This property is known as ergodicity. It means that where the 
MDP starts or any early decision made by the agent can have only a temporary effect; in the long run 
your expectation of being in a state depends only on the policy and the MDP transition probabilities. 
Ergodicity is sufficient to guarantee the existence of the limits in the equations above. 

There are subtle distinctions that can be drawn between different kinds of optimality in the undis¬ 
counted continuing case. Nevertheless, for most practical purposes it may be adequate simply to order 
policies according to their average reward per time step, in other words, according to their r( 7 r). This 
quantity is essentially the average reward under 7 r, as suggested by (10.6). In particular, we consider 
all policies that attain the maximal value of r( 7 r) to be optimal. 

Note that the steady state distribution is the special distribution under which, if you select actions 
according to 7 r, you remain in the same distribution. That is, for which 

X^brW^TrCalsXs'ls.a) = finis'). (10.7) 

s a 

In the average-reward setting, returns are defined in terms of differences between rewards and the 
average reward: 

G t = R t +i - r( 7 r) + R t+2 - r( 7 r) + R t+3 - r(n) + ■■■ . ( 10 . 8 ) 

This is known as the differential return, and the corresponding value functions are known as differential 
value functions. They are defined in the same way and we will use the same notation for them as we 
have all along: u T (s) = E^G^S) = s] and q K {s,a) = E^G^/S) = s, A t = a] (similarly for i>* and qff). 
Differential value functions also have Bellman equations, just slightly different from those we have seen 
earlier. We simply remove all qs and replace all rewards by the difference between the reward and the 
true average reward: 

Dr(s) = ^ 7 r(a|s)^p(s',r|s, a) r - r(n) + v n {s') , 

a r,s' 
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9 ir(s,a) = ^p(s',r|s,a) r - r(?r) + 7r(aV)<u(s', a') , 

r,s' a' 

u*(s) = max ) p(s',r\s,a) i — maxr(7r) + v*(s') , and 
a ' ^ L 7T J 

r,s' 

q*(s, a) = > pis 7 ,r Is, a) r — maxr(7r) + maxg*(s , ,a / ) 

z ' 7r a' 

r,s' 

(cf. Eqs. 3.14, 4.1, and 4.2). 

There is also a differential form of the two TD errors: 

S t = Rt+i-Rt+i + v(S t +i,w t ) - v(S t ,vr t ), and (10.9) 

St = Rt+i~ Rt+i + q(S t +i,A t+ i, w t ) — q(S t ,A t , w t ), (10.10) 

where Rt is an estimate at time t of the average reward r(7r). With these alternate definitions, most of 
our algorithms and many theoretical results carry through to the average-reward setting. 

For example, the average reward version of semi-gradient Sarsa is defined just as in (10.2) except 
with the differential version of the TD error. That is, by 

w t+ i = w t + aS t A7q(S t , A t , w ( ), (10.11) 

with S t given by (10.10). The pseudocode for the complete algorithm is given in the box. 


Differential semi-gradient Sarsa for estimating q ss q* 


Input: a differentiable function 
Parameters: step sizes a, /3 > 0 

Initialize value-function weights w G arbitrarily (e.g., w = 0) 
Initialize average reward estimate R arbitrarily (e.g., R = 0) 
Initialize state S , and action A 

Repeat (for each step): 

Take action A, observe R, S' 

Choose A' as a function of q(S', •, w) (e.g., e-greedy) 

Jt- R_-R + q(S',A',w)-q(S,A,w) 

R<-R + /3S 
w x— w + aSVq(S , A, w) 

S ^ S' 

A x- A' 
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Example 10.2: An Access-Control Queuing Task This is a decision task involving access control 
to a set of k servers. Customers of four different priorities arrive at a single queue. If given access to a 
server, the customers pay a reward of 1, 2, 4, or 8 to the server, depending on their priority, with higher 
priority customers paying more. In each time step, the customer at the head of the queue is either 
accepted (assigned to one of the servers) or rejected (removed from the queue, with a reward of zero). 
In either case, on the next time step the next customer in the queue is considered. The queue never 
empties, and the priorities of the customers in the queue are equally randomly distributed. Of course a 
customer cannot be served if there is no free server; the customer is always rejected in this case. Each 
busy server becomes free with probability p on each time step. Although we have just described them 
for definiteness, let us assume the statistics of arrivals and departures are unknown. The task is to 
decide on each step whether to accept or reject the next customer, on the basis of his priority and the 
number of free servers, so as to maximize long-term reward without discounting. 

In this example we consider a tabular solution to this problem. Although there is no generalization 
between states, we can still consider it in the general function approximation setting as this setting 
generalizes the tabular setting. Thus we have a differential action-value estimate for each pair of state 
(number of free servers and priority of the customer at the head of the queue) and action (accept or 
reject). Figure 10.5 shows the solution found by differential semi-gradient Sarsa for this task with 
k = 10 and p = 0.06. The algorithm parameters were a = 0.01, /? = 0.01, and e = 0.1. The initial 
action values and R were zero. 

1 

2 

Priority 

4 

8 

'l^'s^'s'e'v's'g'io 

Number of free servers 



Policy 



Value 

Function 


Figure 10.5: The policy and value function found by differential semi-gradient one-step Sarsa on the access- 
control queuing task after 2 million steps. The drop on the right of the graph is probably due to insufficient 
data; many of these states were never experienced. The value learned for R was about 2.31. ■ 
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10.4 Deprecating the Discounted Setting 

The continuing, discounted problem formulation has been very useful in the tabular case, in which the 
returns from each state can be separately identified and averaged. But in the approximate case it is 
questionable whether one should ever use this problem formulation. 

To see why, consider an infinite sequence of returns with no beginning or end, and no clearly identified 
states. The states might be represented only by feature vectors, which may do little to distinguish the 
states from each other. As a special case, all of the feature vectors may be the same. Thus one really 
has only the reward sequence (and the actions), and performance has to be assessed purely from these. 
How could it be done? One way is by averaging the rewards over a long interval—this is the idea of 
the average-reward setting. How could discounting be used? Well, for each time step we could measure 
the discounted return. Some returns would be small and some big, so again we would have to average 
them over a sufficiently large time interval. In the continuing setting there are no starts and ends, and 
no special time steps, so there is nothing else that could be done. However, if you do this, it turns out 
that the average of the discounted returns is proportional to the average reward. In fact, for policy 
7 r, the average of the discounted returns is always r(7r)/(l — 7 ), that is, it is essentially the average 
reward, r(7r). In particular, the ordering of all policies in the average discounted return setting would 
be exactly the same as in the average-reward setting. The discount rate 7 thus has no effect on the 
problem formulation. It could in fact be zero and the ranking would be unchanged. 

This surprising fact is proven in the box, but the basic idea can be seen via a symmetry argument. 
Each time step is exactly the same as every other. With discounting, every reward will appear exactly 
once in each position in some return. The fth reward will appear undiscounted in the t — 1st return, 
discounted once in the t — 2nd return, and discounted 999 times in the t — 1000th return. The weight on 


The Futility of Discounting in Continuing Problems 


Perhaps discounting can be saved by choosing an objective that sums discounted values over the 
distribution with which states occur under the policy: 

= Y g, 7T (s)v^(s) (where iP is the discounted value function) 

S 


Y Ms) Y Y p ( s '' r 1 s > a ) t r +(«')] 

s a s' t 

(Bellman Eq.) 

Ka)+ Y Y p ( s '’ r 1 a )"K( («') 

s a s' t 

(from (10.6)) 

r(n) +-jY v ^ s "> M s ) Y 7r ( a l s )p( s/ 1 s >«) 

S' S CL 

(from (3.4)) 

Hi r)+7^ u !f( s, K( s ') 

(from (10.7)) 

r ( 7r ) + 7 J{n) 


r( tt) + 7r(7r) + 7 s J(7r) 


r ( 7r ) + 7 J'(tt) + 7 2 r(7r) + 7 3 r(7r) + • • • 





The proposed discounted objective orders policies identically to the undiscounted (average re¬ 
ward) objective. We have failed to save discounting! 
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the tth reward is thus l+ 7 + 7 2 + 7 3 + - • • = 1/(1 — 7 ). Since all states are the same, they are all weighted 
by this, and thus the average of the returns will be this times the average reward, or r(7r)/(l — 7 ). 

So in this key case, which the discounted case was invented for, discounting is not applicable. The 
discounted case is still pertinent, or at least possible, for the episodic case. 


10.5 n-step Differential Semi-gradient Sarsa 

In order to generalize to n-step bootstrapping, we need an n-step version of the TD error. We begin by 
generalizing the ?r-step return (7.4) to its differential form, with function approximation: 

Gt-.t+n = Rt+l — Rt+1 + Rt+2 — R-t+2 + • • • + Rt+n — Rt+n + Q{St+ n , ^-t+m^t+n- 1 ), ( 10 . 12 ) 

where R is an estimate of r(n), n > 1, and t + n < T. If t + n > T, then we define G t:t+ „ = G t as 
usual. The n-step TD error is then 

fit. = Gf.t.+n — A t , w), (10.13) 

after which we can apply our usual semi-gradient Sarsa update (10.11). Pseudocode for the complete 
algorithm is given in the box. 


Differential semi-gradient n-step Sarsa for estimating q ss </*, or q ss q n 


Input: a differentiable function q : § x A x R m —> R, a policy 7r 
Initialize value-function weights w £ R m arbitrarily (e.g., w = 0) 

Initialize average-reward estimate R £ ffi. arbitrarily (e.g., R = 0) 
Parameters: step size a, fd > 0, a positive integer n 

All store and access operations (St, A t , and Rt) can take their index mod n 

Initialize and store So and Aq 
For t = 0,1, 2,... : 

Take action A t 

Observe and store the next reward as R t + \ and the next state as 'S'i+l 
Select and store an action A t +\ ~ 7r(-|S' t+ i), or e-greedy wrt q(So, •, w) 
t ■£- f — n + 1 (ris the time whose estimate is being updated) 

If r > 0: 

I2i=T+i( R i -R) + q(S T+n ,A T+n , w) - q(S T ,A T , w) 

R £- R + /36 

w <— w + a6\7q(S T , A t , w) 


10.6 Summary 

In this chapter we have extended the ideas of parameterized function approximation and semi-gradient 
descent, introduced in the previous chapter, to control. The extension is immediate for the episodic case, 
but for the continuing case we have to introduce a whole new problem formulation based on maximizing 
the average reward per time step. Surprisingly, the discounted formulation cannot be carried over to 
control in the presence of approximations. In the approximate case most policies cannot be represented 
by a value function. The arbitrary policies that remain need to be ranked, and the scalar average reward 
r( 7 r) provides an effective way to do this. 
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The average reward formulation involves new differential versions of value functions, Bellman equa¬ 
tions, and TD errors, but all of these parallel the old ones, and the conceptual changes are small. There 
is also a new parallel set of differential algorithms for the average-reward case. We illustrate this by 
developing differential versions of semi-gradient n-step Sarsa. 


Bibliographical and Historical Remarks 

10.1 Semi-gradient Sarsa with function approximation was first explored by Rummery and Niranjan 
(1994). Linear semi-gradient Sarsa with e-greedy action selection does not converge in the 
usual sense, but does enter a bounded region near the best solution (Gordon, 1995). Precup 
and Perkins (2003) showed convergence in a differentiable action selection setting. See also 
Perkins and Pendrith (2002) and Melo, Meyn, and Ribiero (2008). The mountain-car example 
is based on a similar task studied by Moore (1990), but the exact form used here is from Sutton 
(1996). 

10.2 Episodic n-step semi-gradient Sarsa is based on the forward Sarsa(A) algorithm of van Seijen 
(2016). The empirical results shown here are new to the second edition of this text. 

10.3 The average-reward formulation has been described for dynamic programming (e.g., Puterman, 
1994) and from the point of view of reinforcement learning (Mahadevan, 1996; Tadepalli and 
Ok, 1994; Bertsekas and Tsitiklis, 1996; Tsitsiklis and Van Roy, 1999). The algorithm described 
here is the on-policy analog of the “R-learning” algorithm introduced by Schwartz (1993). The 
name R-learning was probably meant to be the alphabetic successor to Q-learning, but we prefer 
to think of it as a reference to the learning of differential or relative values. The access-control 
queuing example was suggested by the work of Carlstrom and Nordstrom (1997). 

10.4 The recognition of the limitations of discounting as a formulation of the reinforcement learn¬ 
ing problem with function approximation became apparent to the authors shortly after the 
publication of the first edition of this text. The second edition of this book may be the first 
publication of the demonstration of the futility of discounting in the box on page 205. 

10.5 The differential version of n-step semi-gradient Sarsa is new to this text and has not been 
significantly studied. 
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Chapter 11 

*Off-policy Methods with 
Approximation 


This book has treated on-policy and off-policy learning methods since Chapter 5 primarily as two 
alternative ways of handling the conflict between exploitation and exploration inherent in learning forms 
of generalized policy iteration. The two chapters preceding this have treated the on-policy case with 
function approximation, and in this chapter we treat the off -policy case with function approximation. 
The extension to function approximation turns out to be significantly different and harder for off-policy 
learning than it is for on-policy learning. The tabular off-policy methods developed in Chapters 6 
and 7 readily extend to semi-gradient algorithms, but these algorithms do not converge as robustly 
as they do under on-policy training. In this chapter we explore the convergence problems, take a 
closer look at the theory of linear function approximation, introduce a notion of learnability, and then 
discuss new algorithms with stronger convergence guarantees for the off-policy case. In the end we will 
have improved methods, but the theoretical results will not be as strong, nor the empirical results as 
satisfying, as they are for on-policy learning. Along the way, we will gain a deeper understanding of 
approximation in reinforcement learning for on-policy learning as well as off-policy learning. 

Recall that in off-policy learning we seek to learn a value function for a target policy 7r, given data 
due to a different behavior policy b. In the prediction case, both policies are static and given, and we 
seek to learn either state values v « v n or action values q ss q^. In the control case, action values are 
learned, and both policies typically change during learning—7r being the greedy policy with respect to 
q , and b being something more exploratory such as the e-greedy policy with respect to q. 

The challenge of off-policy learning can be divided into two parts, one that arises in the tabular 
case and one that arises only with function approximation. The first part of the challenge has to do 
with the target of the update (not to be confused with the target policy), and the second part has to 
do with the distribution of the updates. The techniques related to importance sampling developed in 
Chapters 5 and 7 deal with the first part; these may increase variance but are needed in all successful 
algorithms, tabular and approximate. The extension of these techniques to function approximation are 
quickly dealt with in the first section of this chapter. 

Something more is needed for the second part of the challenge of off-policy learning with function 
approximation because the distribution of updates in the off-policy case is not according to the on- 
policy distribution. The on-policy distribution is important to the stability of semi-gradient methods. 
Two general approaches have been explored to deal with this. One is to use importance sampling 
methods again, this time to warp the update distribution back to the on-policy distribution, so that 
semi-gradient methods are guaranteed to converge (in the linear case). The other is to develop true 
gradient methods that do not rely on any special distribution for stability. We present methods based 
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on both approaches. This is a cutting-edge research area, and it is not clear which of these approaches 
is most effective in practice. 


11.1 Semi-gradient Methods 


We begin by describing how the methods developed in earlier chapters for the off-policy case extend 
readily to function approximation as semi-gradient methods. These methods address the first part of 
the challenge of off-policy learning (changing the update targets) but not the second part (changing 
the update distribution). Accordingly, these methods may diverge in some cases, and in that sense are 
not sound, but still they are often successfully used. Remember that these methods are guaranteed 
stable and asymptotically unbiased for the tabular case, which corresponds to a special case of function 
approximation. So it may still be possible to combine them with feature selection methods in such a 
way that the combined system could be assured stable. In any event, these methods are simple and 
thus a good place to start. 

In Chapter 7 we described a variety of tabular off-policy algorithms. To convert them to semi-gradient 
form, we simply replace the update to an array (V or Q) to an update to a weight vector (w), using 
the approximate value function (v or q ) and its gradient. Many of these algorithms use the per-step 
importance sampling ratio: 


^ _ 7r(A(|5 t ) 

b(A t \S t )' 


( 11 . 1 ) 


For example, the one-step, state-value algorithm is semi-gradient off-policy TD(0), which is just like 
the corresponding on-policy algorithm (page 166) except for the addition of pt : 


w t+ i = w t + ap t 6 t Vv(St,-wt), 

(11.2) 

where S t is defined appropriately depending on whether the problem is 
continuing and undiscounted using average reward: 

episodic and discounted, or 

S t = R t + 1 + 7 h(S' t+ i,w t ) - v(S t , w t ), or 

(11.3) 

S t = R t +i - Rt + u(5 (+ i,w t ) - v(S t , w t ). 

(11.4) 

For action values, the one-step algorithm is semi-gradient Expected Sarsa: 


w t+ i = w t + aS t Vq(S t , A t , w ( ), with 

(11.5) 

S t = Rt+i +7^7r(a|S' t+ i)g(S' t+ i,a,w i ) - q(S t ,A t , w t ), or 

a 

(episodic) 

S t = R t+ i - Rt + ^ 7r(o 5 t+ i)g(5 t+ i, a, w t ) - q(S t , A t , w t ). 

(continuing) 


a 


Note that this algorithm does not use importance sampling. In the tabular case it is clear that this is 
appropriate because the only sample action is A t , and in learning its value we do not have to consider 
any other actions. With function approximation it is less clear because we might want to weight dif¬ 
ferent state-action pairs differently once they all contribute to the same overall approximation. Proper 
resolution of this issue awaits a more thorough understanding of the theory of function approximation 
in reinforcement learning. 



11.2. EXAMPLES OF OFF-POLICY DIVERGENCE 


211 


In the multi-step generalizations of these algorithms, both the state-value and action-value algorithms 
involve importance sampling. For example, the n-step version of semi-gradient Expected Sarsa is 

w t+n = w t+n _i + ap t+ 1 • • • pt+n- 1 [Gut+n ~ q(S t , A t , w t+ „_i)] Vq(S t ,A t , w t+n _i) (11.6) 

with 


Gt-.t+n — Rt+i + • • • + 7" 1 Rt.+n + 7”9(5 t+ „, A t+ra , w t+ „_ 1 ), or (episodic) 

Gt-.t+n = -Rt+i — Rt + ''' + ^?t+n ~ Rt+n- 1 + q(Rt.+m A t + n , w t+n _i), (continuing) 

where here we are being slightly informal in our treatment of the ends of episodes. In the first equation, 
the pk s for k > T (where T is the last time step of the episode) should be taken to be 1, and Gp n 
should be taken to be Gt if t + n > T. 

Recall that we also presented in Chapter 7 an off-policy algorithm that does not involve importance 
sampling at all: the n-step tree-backup algorithm. Here is its semi-gradient version: 

w t+n = w t+n _i + a [Gt:t+n - q(S t , At, w t+n _i)] Vq(S t , A t , w t+n _i), with (11.7) 


t-\-n— 1 k 

Gt-.t+n = q(S t ,A t , w t _i) + E**n 77r(Aj|5j), (11.8) 

k—t i=t +1 

with St as defined at the top of this page for Expected Sarsa. We also defined in Chapter 7 an algorithm 
that unifies all action-value algorithms: n-step Q{a). We leave the semi-gradient form of that algorithm, 
and also of the n-step state-value algorithm, as exercises for the reader. 

Exercise 11.1 Convert the equation of n-step off-policy TD (7.7) to semi-gradient form. Give accom¬ 
panying definitions of the return for both the episodic and continuing cases. □ 

*Exercise 11.2 Convert the equations of n-step Q(a) (7.9, 7.14, 7.16, and 7.17) to semi-gradient form. 
Give definitions that cover both the episodic and continuing cases. □ 


11.2 Examples of Off-policy Divergence 

In this section we begin to discuss the second part of the challenge of off-policy learning with function 
approximation —that the distribution of updates does not match the on-policy distribution. We describe 
some instructive counterexamples to off-policy learning—cases where semi-gradient and other simple 
algorithms are unstable and diverge. 

To establish intuitions, it is best to consider first a very simple example. Suppose, perhaps as part 
of a larger MDP, there are two states whose estimated values are of the functional form w and 2w, 
where the parameter vector w consists of only a single component w. This occurs under linear function 
approximation if the feature vectors for the two states are each simple numbers (single-component vec¬ 
tors), in this case 1 and 2. In the first state, only one action is available, and it results deterministically 
in a transition to the second state with a reward of 0: 


where the expressions inside the two circles indicate the two state’s values. 

Suppose initially w = 10. The transition will then be from a state of estimated value 10 to a state 
of estimated value 20. It will look like a good transition, and w will be increased to raise the first 
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state’s estimated value. If 7 is nearly 1, then the TD error will be nearly 10, and, if a = 0.1, then w 
will be increased to nearly 11 in trying to reduce the TD error. However, the second state’s estimated 
value will also be increased, to nearly 22. If the transition occurs again, then it will be from a state 
of estimated value «11 to a state of estimated value «22, for a TD error of «11 -larger, not smaller, 
than before. It will look even more like the first state is undervalued, and its value will be increased 
again, this time to «12.1. This looks bad, and in fact with further updates w will diverge to infinity. 

To see this definitively we have to look more carefully at the sequence of updates. The TD error on 
a transition between the two states is 

S t = R t + 1 + 7 ' 0 (»S't+i,w t ) - v(S t , w t ) = 0 + 72 w t - w t = (27 - l)w u 
and the off-policy semi-gradient TD(0) update (from (11.2)) is 

w t +i = w t + ap t 5 t Vv(St,wt) = w t + a • 1 • (27 - l)w t • 1 = (l + 0(27 - l))ty t . 

Note that the importance sampling ratio, pt , is 1 on this transition because there is only one action 
available from the first state, so its probabilities of being taken under the target and behavior policies 
must both be 1. In the final update above, the new parameter is the old parameter times a scalar 
constant, 1 + a(2y — 1). If this constant is greater than 1, then the system is unstable and w will go to 
positive or negative infinity depending on its initial value. Here this constant is greater than 1 whenever 
7 > 0.5. Note that stability does not depend on the specific step size, as long as a > 0. Smaller or 
larger step sizes would affect the rate at which w goes to infinity, but not whether it goes there or not. 

Key to this example is that the one transition occurs repeatedly without w being updated on other 
transitions. This is possible under off-policy training because the behavior policy might select actions 
on those other transitions which the target policy never would. For these transitions, p t would be zero 
and no update would be made. Under on-policy training, however, p t is always one. Each time there 
is a transition from the w state to the 2 w state, increasing w, there would also have to be a transition 
out of the 2w state. That transition would reduce w, unless it were to a state whose value was higher 
(because 7 < 1 ) than 2w, and then that state would have to be followed by a state of even higher value, 
or else again w would be reduced. Each state can support the one before only by creating a higher 
expectation. Eventually the piper must be paid. In the on-policy case the promise of future reward 
must be kept and the system is kept in check. But in the off-policy case, a promise can be made and 
then, after taking an action that the target policy never would, forgotten and forgiven. 

This simple example communicates much of the reason why off-policy training can lead to divergence, 
but it is not completely convincing because it is not complete—it is just a fragment of a complete MDP. 
Can there really be a complete system with instability? A simple complete example of divergence is 
Baird’s counterexample. Consider the episodic seven-state, two-action MDP shown in Figure 11.1. The 
dashed action takes the system to one of the six upper states with equal probability, whereas the solid 
action takes the system to the seventh state. The behavior policy b selects the dashed and solid actions 
with probabilities | and so that the next-state distribution under it is uniform (the same for all 
nonterminal states), which is also the starting distribution for each episode. The target policy 7 r always 
takes the solid action, and so the on-policy distribution (for 7r) is concentrated in the seventh state. 
The reward is zero on all transitions. The discount rate is 7 = 0.99. 

Consider estimating the state-value under the linear parameterization indicated by the expression 
shown in each state circle. For example, the estimated value of the leftmost state is 2w\ + Wg , where 
the subscript corresponds to the component of the overall weight vector w G I 8 ; this corresponds to a 
feature vector for the first state being x(l) = (2, 0, 0,0, 0, 0, 0,1) T . The reward is zero on all transitions, 
so the true value function is 17 -(s) = 0, for all s, which can be exactly approximated if w = 0. In 
fact, there are many solutions, as there are more components to the weight vector ( 8 ) than there are 
nonterminal states (7). Moreover, the set of feature vectors, {x(s) : s £ S}, is a linearly independent 
set. In all these ways this task seems a favorable case for linear function approximation. 
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7r(solid | ■) = 1 

/^(dashed | •) = 6/7 
^t(sol id | ■) = 1/7 

7 = 0.99 


Figure 11.1: Baird’s counterexample. The approximate state-value function for this Markov process is of the 
form shown by the linear expressions inside each state. The solid action usually results in the seventh state, 
and the dashed action usually results in one of the other six states, each with equal probability. The reward is 
always zero. 


If we apply semi-gradient TD(0) to this problem (11.2), then the weights diverge to infinity, as shown 
in Figure 11.2 (left). The instability occurs for any positive step size, no matter how small. In fact, 
it even occurs if a expected update is done as in dynamic programming (DP). If we do a DP-style 
expected update instead of a sample (learning) update, as shown in Figure 11.2 (right). That is, if the 
weight vector, w*,, is updated for all states all at the same time in a semi-gradient way, using the DP 
(expectation-based) target: 

w fc+ i = w fc + j|r y^(E[i? t +i + 7D(5 t+ i,w fc ) I S t = s] - v(s,w k ) S jVv(s,w k ). (11.9) 

1 1 S 

In this case, there is no randomness and no asynchrony, just as in a classical DP update. The method is 
conventional except in its use of semi-gradient function approximation. Yet still the system is unstable. 



Figure 11.2: Demonstration of instability on Baird’s counterexample. Shown are the evolution of the compo¬ 
nents of the parameter vector w of the two semi-gradient algorithms. The step size was a = 0.01, and the initial 
weights were w = (1, 1 , 1 , 1 , 1 , 1 , 10,1) T . 
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If we alter just the distribution of DP updates in Baird’s counterexample, from the uniform distribu¬ 
tion to the on-policy distribution (which generally requires asynchronous updating), then convergence 
is guaranteed to a solution with error bounded by (9.14). This example is striking because the TD 
and DP methods used are arguably the simplest and best-understood bootstrapping methods, and the 
linear, semi-descent method used is arguably the simplest and best-understood kind of function ap¬ 
proximation. The example shows that even the simplest combination of bootstrapping and function 
approximation can be unstable if the updates are not done according to the on-policy distribution. 

There are also counterexamples similar to Baird’s showing divergence for Q-learning. This is cause 
for concern because otherwise Q-learning has the best convergence guarantees of all control methods. 
Considerable effort has gone into trying to find a remedy to this problem or to obtain some weaker, but 
still workable, guarantee. For example, it may be possible to guarantee convergence of Q-learning as 
long as the behavior policy is sufficiently close to the target policy, for example, when it is the e-greedy 
policy. To the best of our knowledge, Q-learning has never been found to diverge in this case, but there 
has been no theoretical analysis. In the rest of this section we present several other ideas that have 
been explored. 

Suppose that instead of taking just a step toward the expected one-step return on each iteration, as 
in Baird’s counterexample, we actually change the value function all the way to the best, least-squares 
approximation. Would this solve the instability problem? Of course it would if the feature vectors, 
|x(s) : s £ S}, formed a linearly independent set, as they do in Baird’s counterexample, because then 
exact approximation is possible on each iteration and the method reduces to standard tabular DP. But 
of course the point here is to consider the case when an exact solution is not possible. In this case 
stability is not guaranteed even when forming the best approximation at each iteration, as shown by 
the counterexample in the box. 


Tsitsiklis and Van Roy’s Counterexample to DP policy evaluation with least-squares linear func¬ 
tion approximation 


The simplest full counterexample to the least-squares idea is 
the w-to-2w example (from earlier in this section) extended 
with a terminal state, as shown to the right. As before, the 
estimated value of the first state is iv, and the estimated 
value of the second state is 2w. The reward is zero on all 
transitions, so the true values are zero at both states, which 
is exactly representable with w = 0. If we set w k +i at each 
step so as to minimize the VE between the estimated value 
and the expected one-step return, then we have 

Wk+i = arg min VV v(s,w) - EjR t+ i + yv(S t +i,w k ) I 

tuER z ' \ 
sES 

= argmin (w — j2wk) 2 + (2w — (1 — e)j2wk) 2 
6 — 4e 

= —^— iw k . 

5 

The sequence {iCfc} diverges when 7 > 6 y 5 4£ and Wq ^ 0. 


Another way to try to prevent instability is to use special methods for function approximation. In 
particular, stability is guaranteed for function approximation methods that do not extrapolate from 
the observed targets. These methods, called averagers , include nearest neighbor methods and locally 
weighted regression, but not popular methods such as tile coding and artificial neural networks. 


1 — e 



S t =s ]) 2 

( 11 . 10 ) 
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11.3 The Deadly Triad 

Our discussion so far can be summarized by saying that the danger of instability and divergence arises 
whenever we combine all of the following three elements, making up what we call the deadly triad: 

Function approximation A powerful, scalable way of generalizing from a state space much larger 
than the memory and computational resources (e.g., linear function approximation or artificial 
neural networks). 

Bootstrapping Update targets that include existing estimates (as in dynamic programming or TD 
methods) rather than relying exclusively on actual rewards and complete returns (as in MC 
methods). 

Off-policy training Training on a distribution of transitions other than that produced by the tar¬ 
get policy. Sweeping through the state space and updating all states uniformly, as in dynamic 
programming, does not respect the target policy and is an example of off-policy training. 

In particular, note that the danger is not due to control, or to generalized policy iteration. Those cases 
are more complex to analyze, but the instability arises in the simpler prediction case whenever it includes 
all three elements of the deadly triad. The danger is also not due to learning or to uncertainties about 
the environment, because it occurs just as strongly in planning methods, such as dynamic programming, 
in which the environment is completely known. 

If any two elements of the deadly triad are present, but not all three, then instability can be avoided. 
It is natural, then, to go through the three and see if there is any one that can be given up. 

Of the three, function approximation most clearly cannot be given up. We need methods that scale to 
large problems and to great expressive power. We need at least linear function approximation with many 
features and parameters. State aggregation or nonparametric methods whose complexity grows with 
data are too weak or too expensive. Least-squares methods such as LSTD are of quadratic complexity 
and are therefore too expensive for large problems. 

Doing without bootstrapping is possible, at the cost of computational and data efficiency. Perhaps 
most important are the losses in computational efficiency. Monte Carlo (non-bootstrapping) methods 
require memory to save everything that happens between making each prediction and obtaining the 
final return, and all their computation is done once the final return is obtained. The cost of these 
computational issues is not apparent on serial von Neumann computers, but would be on specialized 
hardware. With bootstrapping and eligibility traces (Chapter 12), data can be dealt with when and 
where it is generated, then need never be used again. The savings in communication and memory made 
possible by bootstrapping are great. 

The losses in data efficiency by giving up bootstrapping are also significant. We have seen this re¬ 
peatedly, such as in Chapters 7 (Figure 7.2) and 9 (Figure 9.2), where some degree of bootstrapping 
performed much better than Monte Carlo methods on the random-walk prediction task, and in Chap¬ 
ter 10 where the same was seen on the Mountain-Car control task (Figure 10.4). Many other problems 
show much faster learning with bootstrapping (e.g., see Figure 12.14). Bootstrapping often results in 
faster learning because it allows learning to take advantage of the state property, the ability to recog¬ 
nize a state upon returning to it. On the other hand, bootstrapping can impair learning on problems 
where the state representation is poor and causes poor generalization (e.g., this seems to be the case 
on Tetris, see §im§ek, Algorta, and Kotlriyal, 2016). A poor state representation can also result in 
bias; this is the reason for the poorer bound on the asymptotic approximation quality of bootstrapping 
methods (Equation 9.14). On balance, the ability to bootstrap has to be considered extremely valuable. 
One may sometimes choose not to use it by selecting long multistep updates (or a large bootstrapping 
parameter, A ~ 1; see Chapter 12) but often bootstrapping greatly increases efficiency. It is an ability 
that we would very much like to keep in our toolkit. 
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Finally, there is off-policy learning ; can we give that up? On-policy methods are often adequate. 
For model-free reinforcement learning, one can simply use Sarsa rather than Q-learning. Off-policy 
methods free behavior from the target policy. This could be considered an appealing convenience but 
not a necessity. However, off-policy learning is essential to other anticipated use cases, cases that we 
have not yet mentioned in this book but may be important to the larger goal of creating a powerful 
intelligent agent. 

In these use cases, the agent learns not just a single value function and single policy, but large 
numbers of them in parallel. There is extensive psychological evidence that people and animals learn 
to predict many different sensory events, not just rewards. We can be surprised by unusual events, and 
correct our predictions about them, even if they are of neutral valence (neither good nor bad). This 
kind of prediction presumably underlies predictive models of the world such as are used in planning. 
We predict what we will see after eye movements, how long it will take to walk home, the probability of 
making a jump shot in basketball, and the satisfaction we will get from taking on a new project. In all 
these cases, the events we would like to predict depend on our acting in a certain way. To learn them 
all, in parallel, requires learning from the one stream of experience. There are many target policies, and 
thus the one behavior policy cannot equal all of them. Yet parallel learning is conceptually possible 
because the behavior policy may overlap in part with many of the target policies. To take full advantage 
of this requires off-policy learning. 


11.4 Linear Value-function Geometry 

To better understand the stability challenge of off-policy learning, it is helpful to think about value 
function approximation more abstractly and independently of how learning is done. We can imagine 
the space of all possible state-value functions—all functions from states to real numbers v : § — > R. 
Most of these value functions do not correspond to any policy. More important for our purposes is that 
most are not representable by the function approximator, which by design has far fewer parameters 
than there are states. 

Given an enumeration of the state space § = {si, s^, ■ ■ ■, S|§|}, any value function v corresponds to a 
vector listing the value of each state in order [t>(si), v(s 2 ), ■ ■ ■ ,n(s|s|)] T - This vector representation of a 
value function has as many components as there are states. In most cases where we want to use function 
approximation, this would be far too many components to represent the vector explicitly. Nevertheless, 
the idea of this vector is conceptually useful. In the following, we treat a value function and its vector 
representation interchangably. 

To develop intuitions, consider the case with three states § = {si,S 2 ,S 3 } and two parameters w = 
(wi,W 2 )' ■ We can then view all value functions/vectors as points in a three-dimensional space. The 
parameters provide an alternative coordinate system over a two-dimensional subspace. Any weight 
vector w = (wi, W 2 ) T is a point in the two-dimensional subspace and thus also a complete value function 
v w that assigns values to all three states. With general function approximation the relationship between 
the full space and the subspace of representable functions could be complex, but in the case of linear 
value-function approximation the subspace is a simple plane, as suggested by Figure 11.3. 

Now consider a single fixed policy n. We assume that its true value function, tv, is too complex to 
be represented exactly as an approximation. Thus iv is not in the subspace; in the figure it is depicted 
as being above the planar subspace of representable functions. 

If tv cannot be represented exactly, what representable value function is closest to it? This turns out 
to be a subtle question with multiple answers. To begin, we need a measure of the distance between two 
value functions. Given two value functions v\ and i> 2 , we can talk about the vector difference between 
them, v = V\ — v%. If v is small, then the two value functions are close to each other. But how are 
we to measure the size of this difference vector? The conventional Euclidean norm is not appropriate 
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Figure 11.3: The geometry of linear value-function approximation. Shown is the three-dimensional space of all 
value functions over three states, while shown as a plane is the subspace of all value functions representable by 
a linear function approximator with parameter w = (wi, W 2 ) . The true value function v w is in the larger space 
and can be projected down (into the subspace, using a projection operator II) to its best approximation in the 
value error (VE) sense. The best approximators in the Bellman error (BE), projected Bellman error (PBE), 
and temporal difference error (TDE) senses are all potentially different and are shown in the lower right. (VE, 
BE, and PBE are all treated as the corresponding vectors in this figure.) The Bellman operator takes a value 
function in the plane to one outside, which can then be projected back. If you iteratively applied the Bellman 
operator outside the space (shown in gray above) you would reach the true value function, as in conventional 
dynamic programming. If instead you kept projecting back into the subspace at each step, as in the lower step 
shown in gray, then the fixed point would be the point of vector-zero PBE. 


because, as discussed in Section 9.2, some states are more important than others because they occur 
more frequently or because we are more interested in them (Section 9.10). As in Section 9.2, let us use 
the weighting /i : S —> R. to specify the degree to which we care about different states being accurately 
valued (often taken to be the on-policy distribution). We can then define the distance between value 
functions using the norm 


Ml = J^fi(s) y (s) 2 . 


(11.11) 


Note that the VE from Section 9.2 can be written simply using this norm as VE(w) = ||u w — u w ||. For 
any value function v, the operation of finding its closest value function in the subspace of representable 
value functions is a projection operation. We define a projection operator II that takes an arbitrary 
value function to the representable function that is closest in our norm: 

IIu = w w where w = argmin ||i> — u w ||" ' . (11.12) 

W ^ 


The representable value function that is closest to the true value function v„ is thus its projection, 
as suggested in Figure 11.3. This is the solution asymptotically found by Monte Carlo methods, albeit 
often very slowly. The projection operation is discussed more fully in the box on the next page. 
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The projection matrix 


For a linear function approximator, the projection operation is linear, which implies that it can 
be represented as an |S| x |S| matrix: 

n = X(X T DX) _1 X T D, (11.13) 

where, as in Section 9.4, D denotes the |S| x |S| diagonal matrix with the /i(s) on the diagonal, 
and X denotes the |S| x d matrix whose rows are the feature vectors x(s) , one for each state 
s. If the inverse in does not exist, then the pseudoinverse is substituted. Using these matrices, 
the norm of a vector can be written 

IMljH = u t E)w, (11-14) 

and the approximate linear value function can be written 

u w = Xw. (11.15) 

TD methods find different solutions. To understand their rationale, recall that the Bellman equation 
for value function v„ is 

v 1 r(s) = 7r(q|s) ^>(s', r 1A a ) [ r + 7ty(s')] i for all s £ S. (11.16) 

CL s' , 7 * 

tV is the only value function that solves this equation exactly. If an approximate value function v w 
were substituted for tv, the difference between the right and left sides of the modified equation could 
be used as a measure of how far off t; w is from v n . We call this the Bellman error at state s: 


Ms) = X! 7r ( a l s )X^( s/ ’ r l s ’ a )M aMs')] - Ms) 

\ a s',r J 

= E[i? t+ i +7t; w (5 t+ i) - v w (S t ) | S t = s, A t ~ 7r] , 


(11.17) 


(11.18) 


which shows clearly the relationship of the Bellman error to the TD error (11.3). The Bellman error is 
the expectation of the TD error. 

The vector of all the Bellman errors, at all states, (5 W £ is called the Bellman error vector 

(shown as BE in Figure 11.3). The overall size of this vector, in the norm, is an overall measure of the 
error in the value function, called the Mean Squared Bellman Error: 


BE(w) = S v 


(11.19) 


It is not possible in general to reduce the BE to zero (at which point w w = v), but for linear func¬ 
tion approximation there is a unique value of w for which the BE is minimized. This point in the 
representable-function subspace (labeled min BE in Figure 11.3) is different in general from that which 
minimizes the VE (shown as n?v). Methods that seek to minimize the BE are discussed in the next 
two sections. 

The Bellman error vector is shown in Figure 11.3 as the result of applying the Bellman operator 
B n : to the approximate value function. The Bellman operator is defined by 


( B n v)(s ) = ^7r(a|s)^p(s',r|s,a) [r + 7u(s')], 


( 11 . 20 ) 


for all s £ § and v : § 


L The Bellman error vector for v can be written (5 W = — w w ■ 
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If the Bellman operator is applied to a value function in the representable subspace, then, in general, 
it will produce a new value function that is outside the subspace, as suggested in the figure. In dynamic 
programming (without function approximation), this operator is applied repeatedly to the points outside 
the representable space, as suggested by the gray arrows in the top of Figure 11.3. Eventually that 
process converges to the true value function the only fixed point for the Bellman operator, the only 
value function for which 

v n = B n v w , ( 11 . 21 ) 

which is just another way of writing the Bellman equation for 7r (11.16). 

With function approximation, however, the intermediate value functions lying outside the subspace 
cannot be represented. The gray arrows in the upper part of Figure 11.3 cannot be followed because 
after the first update (dark line) the value function must be projected back into something representable. 
The next iteration then begins within the subspace; the value function is again taken outside of the 
subspace by the Bellman operator and then mapped back by the projection operator, as suggested by 
the lower gray arrow and line. Following these arrows is a DP-like process with approximation. 

In this case we are interested in the projection of the Bellman error vector back into the representable 
space. This is the projected Bellman error vector n5„ w , shown in Figure 11.3 as PBE. The size of this 
vector, in the norm, is another measure of error in the approximate value function. For any approximate 
value function v, we define the Mean Square Projected Bellman Error , denoted PBE, as 

PBE(w) = 11HA W 11 2 . (11.22) 

With linear function approximation there always exists an approximate value function (within the 
subspace) with zero PBE; this is the TD fixed point, wtd, introduced in Section 9.4. As we have seen, 
this point is not always stable under semi-gradient TD methods and off-policy training. As shown in 
the figure, this value function is generally different from those minimizing VE or BE. Methods that are 
guaranteed to converge to it are discussed in Sections 11.7 and 11.8. 


11.5 Stochastic Gradient Descent in the Bellman Error 

Armed with a better understanding of value function approximation and its various objectives, we 
return now to the challenge of stability in off-policy learning. We would like to apply the approach of 
stochastic gradient descent (SGD, Section 9.3), in which updates are made that in expectation are equal 
to the negative gradient of an objective function. These methods always go downhill (in expectation) 
in the objective and because of this are typically stable with excellent convergence properties. Among 
the algorithms investigated so far in this book, only the Monte Carlo methods are true SGD methods. 
These methods converge robustly under both on-policy and off-policy training as well as for general 
non-linear (differentiable) function approximators, though they are often slower than semi-gradient 
methods with bootstrapping, which are not SGD methods. Semi-gradient methods may diverge under 
off-policy training, as we have seen earlier in this chapter, and under contrived cases of non-linear 
function approximation (Tsitsiklis and Van Roy, 1997). With a true SGD method such divergence 
would not be possible. 

The appeal of SGD is so strong that great effort has gone into finding a practical way of harnessing 
it for reinforcement learning. The starting place of all such efforts is the choice of an error or objective 
function to optimize. In this and the next section we explore the origins and limits of the most 
popular proposed objective function, that based on the Bellman error introduced in the previous section. 
Although this has been a popular and influential approach, the conclusion that we reach here is that 
it is a misstep and yields no good learning algorithms. On the other hand, this approach fails in an 
interesting way that offers insight into what might constitute a good approach. 
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To begin, let us consider not the Bellman error, but something more immediate and naive. Temporal 
difference learning is driven by the TD error. Why not take the minimization of the expected square of 
the TD error as the objective? In the general function-approximation case, the one-step TD error with 
discounting is 

S t = R t+ 1 + 7 i)(S , t+ i,w t ) - v(S t , w t ). 

A possible objective function then is what one might call the Mean Squared TD Error: 

TDE(w) = ^/r(s)E[(5 f 2 | S t = s,A t ~ tt] 
seS 

= ^/x(s)E [p t 5t | S t = s,A t ~b] 
ses 

= Eb[pt.St ] . (if p, is the distribution encountered under b ) 

The last equation is of the form needed for SGD; it gives the objective as an expectation that can be 
sampled from experience (remember the experience is due to the behavior policy b. Thus, following the 
standard SGD approach, one can derive the per-step update based on a sample of this expected value: 

w t+ i = w t - ^aV (ptSf) 

= w t — apt.StVSt 

= w t + apt8 t (yv{S t ,v?t) - 7 Vt)(S t+ i,w t )), (11.23) 

which you will recognize as the same as the semi-gradient TD algorithm (11.2) except for the additional 
final term. This term completes the gradient and makes this a true SGD algorithm with excellent 
convergence guarantees. Let us call this algorithm the naive residual-gradient algorithm (after Baird, 
1993). 

Although the naive residual-gradient algorithm converges robustly, it does not always converge to 
a desirable place, as the A-split example in the box shows. In this example a tabular representation 
is used, so the true state values can be exactly represented, yet the naive residual-gradient algorithm 
finds different values, and these values have lower TDE than do the true values. Minimizing the TDE 
is naive; by penalizing all TD errors it achieves something more like temporal smoothing than accurate 
prediction. 


A-split example, showing the naivete of the naive residual gradient algorithm 


Consider the following three-state episodic MRP: 



Episodes begin in state A and then ‘split’ stochastically, half the time going to B and then 
invariably going on to terminate with a reward of 1, and half the time going to state C and then 
invariably terminating with a reward of zero. Reward for the first transition, out of A, is always 
zero whichever way the episode goes. As this is an episodic problem, we can take 7 to be 1. We 
also assume on-policy training, so that p t is always 1 , and tabular function approximation, so 
that the learning algorithms are free to give arbitrary, independent values to all three states. So 
it should be an easy problem. 
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What should the values be? From A, half the time the return is 1, and half the time the 
return is 0; A should have value |. From B the return is always 1, so its value should be 1, and 
similarly from C the return is always 0, so its value should be 0. These are the true values and, 
as this is a tabular problem, all the methods presented previously converge to them exactly. 

However, the naive residual-gradient algorithm finds different values for B and C. It converges 
with B having a value of | and C having a value of ? (A converges correctly to |). These are in 
fact the values that minimize the TDE. 

Let us compute the TDE for these values. The first transition of each episode is either up 
from A’s | to B’s |, a change of 1 , or down from A’s \ to C’s |, a change of — Because 
the reward is zero on these transitions, and 7 = 1, these changes are the TD errors, and thus 
the squared TD error is always A on the first transition. The second transition is similar; it is 
either up from B’s | to a reward of 1 (and a terminal state of value 0), or down from C’s | to 
a reward of 0 (again with a terminal state of value 0). Thus, the TD error is always ±|, for a 
squared error of A on the second step. Thus, for this set of values, the TDE on both steps is 
16 ■ 

Now let’s compute the TDE for the true values (B at 1, C at 0, and A at |). In this case 
the first transition is either from ^ up to 1, at B, or from ? down to 0, at C; in either case the 
absolute error is ^ and the squared error is The second transition has zero error because 
the starting value, either 1 or 0 depending on whether the transition is from B or C, always 
exactly matches the immediate reward and return. Thus the squared TD error is | on the first 
transition and 0 on the second, for a mean reward over the two transitions of |. As | is bigger 
that A, this solution is worse according to the TDE. On this simple problem, the true values 
do not have the smallest TDE. 

A better idea would seem to be minimizing the Bellman error. If the exact values are learned, the 
Bellman error is zero everywhere. Thus, a Bellman-error-minimizing algorithm should have no trouble 
with the A-split example. We cannot expect to achieve zero Bellman error in general, as it would involve 
finding the true value function, which we presume is outside the space of representable value functions. 
But getting close to this ideal is a natural-seeming goal. As we have seen, the Bellman error is also 
closely related to the TD error. The Bellman error for a state is the expected TD error in that state. 
So let’s repeat the derivation above with the expected TD error (all expectations here are implicitly 
conditional on St): 

w t+ i = w t - ^aV(E 7r [<5 t ] 2 ) 

= w t - *aV{E b [p t 5 t ] 2 ) 

= w t - aE b [p t 6 t \ X7E b [p t S t \ 

= w t - aE b [p t (R t+1 + jv(S t +i,w) - t)(S t ,w))] E b [p t VS t ] 

= w t + a E b [p t (R t+1 + 7 i)(S t+ i,w))] - v{S t , w) Vv(S t ,w) - 7 E 6 [p t Vi)(S' t+ i,w)] . 

This update and various ways of sampling it are referred to as the residual gradient algorithm. If you 
simply used the sample values in all the expectations, then the equation above reduces almost exactly to 
(11.23), the naive residual-gradient algorithm. 1 But this is naive, because the equation above involves 
the next state, St+ i, appearing in two expectations that are multiplied together. To get an unbiased 

1 For state values there remains a small difference in the treatment of the importance sampling ratio pt- In the analagous 
action-value case (which is the most important case for control algorithms), the residual gradient algorithm would reduce 
exactly to the naive version. 
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sample of the product, two independent samples of the next state are required, but during normal 
interaction with an external environment only one is obtained. One expectation or the other can be 
sampled, but not both. 

There are two ways to make the residual gradient algorithm work. One is in the case of deterministic 
environments. If the transition to the next state is deterministic, then the two samples w r ill necessarily 
be the same, and the naive algorithm is valid. The other way is to obtain two independent samples 
of the next state, St+ 1 , from St, one for the first expectation and another for the second expectation. 
In real interaction with an environment, this would not seem possible, but when interacting with a 
simulated environment, it is. One simply rolls back to the previous state and obtains an alternate next 
state before proceeding forward from the first next state. In either of these cases the residual gradient 
algorithm is guaranteed to converge to a minimum of the BE under the usual conditions on the step-size 
parameter. As a true SGD method, this convergence is robust, applying to both linear and non-linear 
function approximators. In the linear case, convergence is always to the unique w that minimizes the 
BE. 

However, there remain at least three ways in which the convergence of the residual gradient method 
is unsatisfactory. The first of these is that empirically it is slow, much slower that semi-gradient meth¬ 
ods. Indeed, proponents of this method have proposed increasing its speed by combining it with faster 
semi-gradient methods initially, then gradually switching over to residual gradient for the convergence 
guarantee (Baird and Moore, 1999). The second way in which the residual-gradient algorithm is un¬ 
satisfactory is that it still seems to converge to the wrong values. It does get the right values in all 
tabular cases, such as the A-split example, as for those an exact solution to the Bellman equation is 
possible. But if we examine examples with genuine function approximation, then the residual-gradient 
algorithm, and indeed the BE objective, seem to find the wrong value functions. One of the most telling 
such examples is the variation on the A-split example shown in the box. On the A-presplit example 
the residual-gradient algorithm finds the same poor solution as its naive version. This example shows 
intuitively that minimizing the BE (which the residual-gradient algorithm surely does) may not be a 
desirable goal. 


A-presplit example, a counterexample for the BE 


Consider the following three-state episodic MRP: 

/ (aj)——- <b> : 

A 

Episodes start in either A1 or A2, with equal probability. These two states look exactly the 
same to the function approximator, like a single state A whose feature representation is distinct 
from and unrelated to the feature representation of the other two states, B and C, which are 
also distinct from each other. Specifically, the parameter of the function approximator has three 
components, one giving the value of state B, one giving the value of state C, and one giving 
the value of both states A1 and A2. Other than the selection of the initial state, the system 
is deterministic. If it starts in Al, then it transitions to B with a reward of 0 and then on 
to termination with a reward of 1. If it starts in A2, then it transitions to C, and then to 
termination, with both rewards zero. 

To a learning algorithm, seeing only the features, the system looks identical to the A-split 
example. The system seems to always start in A, followed by either B or C with equal probability, 
and then terminating with a 1 or a 0 depending deterministically on the previous state. As in 
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the A-split example, the true values of B and C are 1 and 0, and the best shared value of A1 and 
A2 is by symmetry. 

Because this problem appears externally identical to the A-split example, we already know 
what values will be found by the algorithms. Semi-gradient TD converges to the ideal values 
just mentioned, while the naive residual-gradient algorithm converges to values of | and | for 
B and C respectively. All state transitions are deterministic, so the non-naive residual-gradient 
algorithm will also converge to these values (it is the same algorithm in this case). It follows 
then that this ‘naive’ solution must also be the one that minimizes the BE, and so it is. On a 
deterministic problem, the Bellman errors and TD errors are all the same, so the BE is always 
the same as the TDE. Optimizing the BE on this example gives rise to the same failure mode 
as with the naive residual-gradient algorithm on the A-split example. 


The third way in which the convergence of the residual-gradient algorithm is not satisfactory is 
explained in the next section. Like the second way, the third way is also a problem with the BE 
objective itself rather than with any particular algorithm for achieving it. 


11.6 The Bellman Error is Not Learnable 

The concept of learnability that we introduce in this section is different from that commonly used in 
machine learning. There, a hypothesis is said to be “learnable” if it is efficiently learnable, meaning 
that it can be learned within a polynomial rather than an exponential number of examples. Here we 
use the term in a more basic way, to mean learnable at all, with any amount of experience. It turns out 
many quantities of apparent interest in reinforcement learning cannot be learned even from an infinite 
amount of experiential data. These quantities are well defined and can be computed given knowledge 
of the internal structure of the environment, but cannot be computed or estimated from the observed 
sequence of feature vectors, actions, and rewards. 2 We say that they are not learnable. It will turn out 
that the Bellman error objective (BE) introduced in the last two sections is not learnable in this sense. 
That the Bellman error objective cannot be learned from the observable data is probably the strongest 
reason not to seek it. 

To make the concept of learnability clear, let’s start with some simple examples. Consider the two 
Markov reward processes 3 (MRPs) diagrammed below: 


0 



2 


Where two edges leave a state, both transitions are assumed to occur with equal probability, and the 
numbers indicate the reward received. All the states appear the same; they all produce the same single¬ 
component feature vector x = 1 and have approximated value w. Thus, the only varying part of the 
data trajectory is the reward sequence. The left MRP stays in the same state and emits an endless 
stream of Os and 2s at random, each with 0.5 probability. The right MRP, on every step, either stays 
in its current state or switches to the other, with equal probability. The reward is deterministic in this 
MRP, always a 0 from one state and always a 2 from the other, but because the each state is equally 
likely on each step, the observable data is again an endless stream of Os and 2s at random, identical to 
that produced by the left MRP. (We can assume the right MRP starts in one of two states at random 

“They would of course be estimated if the state sequence were observed rather than only the corresponding feature 
vectors. 

^All MRPs can be considered MDPs with a single action in all states; what we conclude about MRPs here applies as 
well to MDPs. 
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with equal probability.) Thus, even given even an infinite amount of data, it would not be possible to 
tell which of these two MRPs was generating it. In particular, we could not tell if the MRP has one 
state or two, is stochastic or deterministic. These things are not learnable. 

This pair of MRPs also illustrates that the VE objective (9.1) is not learnable. If 7 = 0, then the 
true values of the three states (in both MRPs), left to right, are 1, 0, and 2. Suppose w = 1. Then the 
VE is 0 for the left MRP and 1 for the right MRP. Because the VE is different in the two problems, 
yet the data generated has the same distribution, the VE cannot be learned. The VE is not a unique 
function of the data distribution. And if it cannot be learned, then how could the VE possibly be useful 
as an objective for learning? 

If an objective cannot be learned, it does indeed draw its utility into question. In the case of the 
VE, however, there is a way out. Note that the same solution, w = 1, is optimal for both MRPs above 
(assuming /i is the same for the two indistinguishable states in the right MRP). Is this a coincidence, 
or could it be generally true that all MDPs with the same data distribution also have the same optimal 
parameter vector? If this is true—and we will show next that it is—then the VE remains a usable 
objective. The VE is not learnable, but the parameter that optimizes it is! 

To understand this, it is useful to bring in another natural objective function, this time one that is 
clearly learnable. One error that is always observable is that between the value estimate at each time 
and the return from that time. The Mean Square Return Error, denoted RE, is the expectation, under 
p, of the square of this error. In the on-policy case the RE can be written 


RE(w) = E (G t - v(S t ,-w)Y 


= VE(w) + E (G t -v v (S t )) 


(11.24) 


Thus, the two objectives are the same except for a variance term that does not depend on the parameter 
vector. The two objectives must therefore have the same optimal parameter value w*. The overall 
relationships are summarized in Figure 11.4. 


Data 

distribution 



Figure 11.4: Causal relationships among the data distribution, MDPs, and errors for Monte-Carlo objectives. 
Two different MDPs can produce the same data distribution yet also produce different VEs, proving that the 
VE objective cannot be determined from data and is not learnable. However, all such VEs must have the same 
optimal parameter vector, w*! Moreover, this same w* can be determined from another objective, the RE, 
which is uniquely determined from the data distribution. Thus w* and the RE are learnable even though the 
VEs are not. 

*Exercise 11.3 Prove (11.24). Hint: Write the RE as an expectation over possible states s of the 
expectation of the squared error given that S t = s. Then add and subtract the true value of state s 
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Data 

distribution 



Figure 11.5: Causal relationships among the data distribution, MDPs, and errors for bootstrapping objectives. 
Two different MDPs can produce the same data distribution yet also produce different BEs and have different 
minimizing parameter vectors; these are not learnable from the data distribution. The PBE and TDE objectives 
and their (different) minima can be directly determined from data and thus are learnable. 


from the error (before squaring), grouping the subtracted true value with the return and the added true 
value with the estimated value. Then, if you expand the square, the most complex term will end up 
being zero, leaving you with (11.24). 

Now let us return to the BE. The BE is like the VE in that it can be computed from knowledge of 
the MDP but is not learnable from data. But it is not like the VE in that its minimum solution is not 
learnable. The box on the next page presents a counterexample—two MRPs that generate the same data 
distribution but whose minimizing parameter vector is different, proving that the optimal parameter 
vector is not a function of the data and thus cannot be learned from it. The other bootstrapping 
objectives that we have considered, the PBE and TDE, can be determined from data (are learnable) 
and determine optimal solutions that are in general different from each other and the BE minimums. 
The general case is summarized in Figure 11.5. 

Thus, the BE is not learnable; it cannot be estimated from feature vectors and other observable data. 
This limits the BE to model-based settings. There can be no algorithm that minimizes the BE without 
access to the underlying MDP states beyond the feature vectors. The residual-gradient algorithm is 
only able to minimize BE because it is allowed to double sample from the same state—not a state that 
has the same feature vector, but one that is guaranteed to be the same underlying state. We can see 
now that there is no way around this. Minimizing the BE requires some such access to the nominal, 
underlying MDP. This is an important limitation of the BE beyond that identified in the A-presplit 
example on page 222. All this directs more attention toward the PBE. 
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Counterexample to the learnability of the BE and its minima 


To show the full range of possibilities we need a slightly more complex pair of Markov reward 
processes (MRPs) than those considered earlier. Consider the following two MRPs: 



0 


Where two edges leave a state, both transitions are assumed to occur with equal probability, 
and the numbers indicate the reward received. The MRP on the left has two states that are 
represented distinctly. The MRP on the right has three states, two of which, B and B', appear 
the same and must be given the same approximate value. Specifically, w has two components 
and the value of state A is given by the first component and the value of B and B' is given by 
the second. The second MRP has been designed so that equal time is spent in all three states, 
so we can take /x(s) = |, for all s. 

Note that the observable data distribution is identical for the two MRPs. In both cases the 
agent will see single occurrences of A followed by a 0, then some number of apparent Bs, each 
followed by a —1 except the last, which is followed by a 1 , then we start all over again with 
a single A and a 0, etc. All the statistical details are the same as well; in both MRPs, the 
probability of a string of k Bs is 2~ k . 

Now suppose w = 0. In the first MRP, this is an exact solution, and the BE is zero. In 
the second MRP, this solution produces a squared error in both B and B' of 1, such that 
BE = /z(B)l + /x( B' ) 1 = |. These two MRPs, which generate the same data distribution, have 
different BEs; the BE is not learnable. 

Moreover (and unlike the earlier example for the VE) the minimizing value of w is different 
for the two MRPs. For the first MRP, w = 0 minimizes the BE for any 7 . For the second MRP, 
the minimizing w is a complicated function of 7 , but in the limit, as 7 —> 1, it is (— 5 ,0) T . Thus 
the solution that minimizes BE cannot be estimated from data alone; knowledge of the MRP 
beyond what is revealed in the data is required. In this sense, it is impossible in principle to 
pursue the BE as an objective for learning. 

It may be surprising that in the second MRP the BE-minimizing value of A is so far from 
zero. Recall that A has a dedicated weight and thus its value is unconstrained by function 
approximation. A is followed by a reward of 0 and transition to a state with a value of nearly 0, 
which suggests c w (A) should be 0; why is its optimal value substantially negative rather than 
0? The answer is that making the value of A negative reduces the error upon arriving in A from 
B. The reward on this deterministic transition is 1, which implies that B should have a value 
1 more than A. Because B’s value is approximately zero, A’s value is driven toward —1. The 
BE-minimizing value of ss — -5 for A is a compromise between reducing the errors on leaving and 
on entering A. 
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11.7 Gradient-TD Methods 


We now consider SGD methods for minimizing the PBE. As true SGD methods, these gradient-TD 
methods have robust convergence properties even under off-policy training and non-linear function 
approximation. Remember that in the linear case there is always an exact solution, the TD fixed point 
wtd , at which the PBE is zero. This solution could be found by least-squares methods (Section 9.7), 
but only by methods of quadratic 0(d 2 ) complexity in the number of parameters. We seek instead an 
SGD method, which should be 0{d) and have robust convergence properties. Gradient-TD methods 
come close to achieving these goals, at the cost of a rough doubling of computational complexity. 

To derive an SGD method for the PBE (assuming linear function approximation) we begin by ex¬ 
panding and rewriting the objective ( 11 . 22 ) in matrix terms: 

PBE(w) = linin' 

= (nj w ) T Dnj w 
= <^n T Dn<! w 

= (5^DX(X t DX) _1 X t D5 w 

(using (11.13) and the identity n T Dn = DX (X T DX) _1 X T D) 

= (x T m w ) T (x T DX) _1 (x T m w ). 

The gradient with respect to w is 

VPBE(w) = 2V[X t D< 5 w ] T (X T DX)~ 1 (X T D<f w ). 

To turn this into an SGD method, we have to sample something on every time step that has this quantity 
as its expected value. Let us take /i to be the distribution of states visited under the behavior policy. 
All three of the factors above can then be written in terms of expectations under this distribution. For 
example, the last factor can be written 

X T D(f w = ^2 /i(s)x(s)<5 w (s) = E[p t 6 t xt \, 

S 

which is just the expectation of the semi-gradient TD(0) update (11.2). The first factor is the transpose 
of the gradient of this update: 

VE[p t <5 t x t ] T = E[p t V<5 t T x7] 

= E [p* V(i?t+i + 7 W T x t+ i - w T x t ) T x < r ] (using episodic S t ) 

= E [pt ( 7 x 4+1 -x t )x t T ] • 

Finally, the middle factor is the inverse of the expected outer-product matrix of the feature vectors: 
X t DX = ^ /i(s)x sX J = E [x t xj] . 


Substituting these expectations for the three factors in our expression for the gradient of the PBE, we 
get 

VPBE(w) = 2E[p t (7X t+1 - x t )x7] E[x t x7] 1 E[p t 5 t x t ]. (11.27) 

It might not be obvious that we have made any progress by writing the gradient in this form. It is a 
product of three expressions and the first and last are not independent. They both depend on the next 


(from (11.14)) 

(11.25) 

(11.26) 
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feature vector x t +i; we cannot simply sample both of these expectations and then multiply the samples. 
This would give us a biased estmate of the gradient just as in the naive residual-gradient algorithm. 

Another idea would be to estimate the three expectations separately and then combine them to 
produce an unbiased estimate of the gradient. This would work, but would require a lot of computational 
resources, particularly to store the first two expectations, which are dx d matrices, and to compute the 
inverse of the second. This idea can be improved. If two of the three expectations are estimated and 
stored, then the third could be sampled and used in conjunction with the two stored quantities. For 
example, you could store estimates of the second two quantities (using the increment inverse-updating 
techniques in Section 9.7) and then sample the first expression. Unfortunately, the overall algorithm 
would still be of quadratic complexity (of order 0(d 2 )). 

The idea of storing some estimates separately and then combining them with samples is a good one 
and is also used in gradient-TD methods. Gradient-TD methods estimate and store the product of the 
second two factors in (11.27). These factors are a d x d matrix and a d-vector, so their product is just 
a d-vector, like w itself. We denote this second learned vector as v: 

v«E[x t x7] 1 E[p t 6 t x t ] ■ (11.28) 

This form is familiar to students of linear supervised learning. It is the solution to a linear least-squares 
problem that tries to approximate ptSt from the features. The standard SGD method for incrementally 
finding the vector v that minimizes the expected squared error (v T x t — ptSt) is known as the Least 
Mean Square (LMS) rule: 

v t+ i = v t + (3p t (S t - x t ) x t , 

where /? > 0 is another step-size parameter. We can use this method to effectively achieve (11.28) with 
0(d) storage and per-step computation. 

Given a stored estimate v t approximating (11.28), we can update our main parameter vector w t 
using SGD methods based on (11.27). The simplest such rule is 

w t+ i = w t — ^aVPBE(w t ) (the general SGD rule) 

= w t - ia 2 E[p t ( 7 X t+ i - x t )x7] E[x t x7] 1 E[p t (5 t x t ] (from (11.27)) 

= w t + oE[p t (x t - 7 x t+ i)x 7 ] E[x t x7] 1 E[p t 5 t x t \ (11.29) 

= w t + aE[p t (x t - 7 X t+ i)x 7 ] v t (based on (11.28)) 

= w t + apt (x t - 7 x 4 + 1 ) xj v t . (sampling) 

This algorithm is called GTD2. Note that if the final inner product (xj v ( ) is done first, then the entire 
algorithm is of 0(d) complexity. 

A slightly better algorithm can be derived by doing a few more analytic steps before substituting in 
v t . Continuing from (11.29): 

w t+ i = w t + aE[p t (x t - 7 x t+ i)x 7 ] E[x t x7] 1 E[ptS t x t ) 

= w t + a (E [p t x t xj] - 7 E [p t x t+1 xj ]) E [x t x 4 ] 1 E[p t 5 t x t ] 

= w t +a (E [x t p t St\ - 7 E[p t x t+ 1 x7] E[x t x7] 1 E^tXt]) 

= w t + a (E[x t/ 9 t dt] - 7 E [ptxt+ixj] v t ) 

= w t + apt (S t x t - 7 X t +i xj Vt) , 


(sampling) 
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which again is O(d) if the final product (xjvt) is done first. This algorithm is known as either TD(0) 
with gradient correction (TDC) or, alternatively, as GTD(O). 

Figure 11.6 shows a sample and the expected behavior of TDC on Baird’s counterexample. As 
intended, the PBE falls to zero, but note that the individual components of the parameter vector do 
not approach zero. In fact, these values are still far from an optimal solution, u(s) = 0, for all s, for 
which w would have to be proportional to (1,1,1,1,1,1,4, —2) T . After 1000 iterations we are still far 
from an optimal solution, as we can see from the VE, which remains almost 2. The system is actually 
converging to an optimal solution, but progress is extremely slow because the PBE is already so close 
to zero. 




Figure 11.6: The behavior of the TDC algorithm on Baird’s counterexample. On the left is shown a typical sin¬ 
gle run, and on the right is shown the expected behavior of this algorithm if the updates are done synchronously 
(analogous to (11.9), except for the two TDC parameter vectors). The step sizes were a = 0.005 and P = 0.05. 

GTD2 and TDC both involve two learning processes, a primary one for w and a secondary one for 
v. The logic of the primary learning process relies on the secondary learning process having finished, 
at least approximately, whereas the secondary learning process proceeds without being influenced by 
the first. We call this sort of asymmetrical dependence a cascade. In cascades we often assume that the 
secondary learning process is proceeding faster and thus is always at its asymptotic value, ready and 
accurate to assist the primary learning process. The convergence proofs for these methods often make 
this assumption explicitly. These are called two-time-scale proofs. The fast time scale is that of the 
secondary learning process, and the slower time scale is that of the primary learning process. If a is 
the step size of the primary learning process, and /3 is the step size of the secondary learning process, 
then these convergence proofs will typically require that in the limit (3 —0 and ^ > 0. 

Gradient-TD methods are currently the most well understood and widely used stable off-policy 
methods. There are extensions to action values and control (GQ, Maei et al., 2010), to eligibility traces 
(GTD(A) and GQ(A), Maei, 2011; Maei and Sutton, 2010), and to nonlinear function approximation 
(Maei et al., 2009). There have also been proposed hybrid algorithms midway between semi-gradient 
TD and gradient TD. The Hybrid TD (HTD, Hackman, 2012; White and White, 2016) algorithm 
behaves like GTD in states where the target and behavior policies are very different, and behaves like 
semi-gradient TD in states where the target and behavior policies are the same. Finally, the gradient- 
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TD idea has been combined with the ideas of proximal methods and control variates to produce more 
efficient methods (Mahadevan et al., 2014). 


11.8 Emphatic-TD Methods 

We turn now to the second major strategy that has been extensively explored for obtaining a cheap 
and efficient off-policy learning method with function approximation. Recall that linear semi-gradient 
TD methods are efficient and stable when trained under the on-policy distribution, and that we showed 
in Section 9.4 that this has to do with the positive definiteness of the matrix A (9.11) and the match 
between the on-policy state distribution fi n and the state-transition probabilities p(s \ s, a) under the 
target policy. In off-policy learning, we reweight the state transitions using importance weighting so 
that they become appropriate for learning about the target policy, but the state distribution is still 
that of the behavior policy. There is a mismatch. A natural idea is to somehow reweight the states, 
emphasizing some and de-empliasizing others, so as to return the distribution of updates to the on- 
policy distribution. There would then be a match, and stability and convergence would follow from 
existing results. This is the idea of Emphatic-TD methods, first introduced for on-policy training in 
Section 9.10. 

Actually, the notion of “the on-policy distribution” is not quite right, as there are many on-policy 
distributions, and any one of these is sufficient to guarantee stability. Consider an undiscounted episodic 
problem. The way episodes terminate is fully determined by the transition probabilities, but there may 
be several different ways the episodes might begin. However the episodes start, if all state transitions 
are due to the target policy, then the state distribution that results is an on-policy distribution. You 
might start close to the terminal state and visit only a few states with high probability before ending 
the episode. Or you might start far away and pass through many states before terminating. Both are 
on-policy distributions, and training on both with a linear semi-gradient method would be guaranteed to 
be stable. However the process starts, an on-policy distribution results as long as all states encountered 
are updated up until termination. 

If there is discounting, it can be treated as partial or probabilistic termination for these purposes. 
If 7 = 0.9, then we can consider that with probability 0.1 the process terminates on every time step 
and then immediately restarts in the state that is transitioned to. A discounted problem is one that is 
continually terminating and restarting with probability 1 — 7 on every step. This way of thinking about 
discounting is an example of a more general notion of pseudo termination —termination that does not 
affect the sequence of state transitions, but does affect the learning process and the quantities being 
learned. This kind of pseudo termination is important to off-policy learning because the restarting is 
optional—remember we can start any way we want to—and the termination relieves the need to keep 
including encountered states within the on-policy distribution. That is, if we don’t consider the new 
states as restarts, then discounting quickly give us a limited on-policy distribution. 

The one-step emphatic-TD algorithm for learning episodic state values is defined by: 

S t = R t +i + 7 ?)(<S) + i,w t ) - v(S t , w t ), 

w t+1 = w t + aM t p t 6tVv(S t , w t ), 

M t = + I t , 

with I t , the interest, being arbitrary and M t , the emphasis, being initialized to M t _i = 0. How does this 
algorithm perform on Baird’s counterexample? Figure 11.7 shows the trajectory in expectation of the 
components of the parameter vector (for the case in which I t = 1, for all t). There are some oscillations 
but eventually everything converges and the VE goes to zero. These trajectories are obtained by 
iteratively computing the expectation of the parameter vector trajectory without any of the variance 
due to sampling of transitions and rewards. We do not show the results of applying the emphatic-TD 
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Figure 11.7: The behavior of the one-step emphatic-TD algorithm in expectation on Baird’s counterexample. 
The step size was a = 0.03. 


algorithm directly because its variance on Baird’s counterexample is so high that it is nigh impossible to 
get consistent results in computational experiments. The algorithm converges to the optimal solution 
in theory on this problem, but in practice it does not. We turn to the topic of reducing the variance of 
all these algorithms in the next section. 


11.9 Reducing Variance 

Off-policy learning is inherently of greater variance than on-policy learning. This is not surprising; if 
you receive data less closely related to a policy, you should expect to learn less about the policy’s values. 
In the extreme, one may be able to learn nothing. You can’t expect to learn how to drive by cooking 
dinner, for example. Only if the target and behavior policies are related, if they visit similar states and 
take similar actions, should one be able to make significant progress in off-policy training. 

On the other hand, any policy has many neighbors, many similar policies with considerable overlap 
in states visited and actions chosen, and yet which are not identical. The raison d’etre of off-policy 
learning is to enable generalization to this vast number of related-but-not-identical policies. The problem 
remains of how to make the best use of the experience. Now that we have some methods that are stable 
in expected value (if the step sizes are set right), attention naturally turns to reducing the variance 
of the estimates. There are many possible ideas, and we can just touch on of a few of them in this 
introductory text. 

Why is controlling variance especially critical in off-policy methods based on importance sampling? 
As we have seen, importance sampling often involves products of policy ratios. The ratios are always 
one in expectation (5.11), but their actual values may be very high or as low as zero. Successive ratios 
are uncorrelated, so their products are also always one in expected value, but they can be of very high 
variance. Recall that these ratios multiply the step size in SGD methods, so high variance means taking 
steps that vary greatly in their sizes. This is problematic for SGD because of the occasional very large 
steps. They must not be so large as to take the parameter to a part of the space with a very different 
gradient. SGD methods rely on averaging over multiple steps to get a good sense of the gradient, and 
if they make large moves from single samples they become unreliable. If the step-size parameter is set 
small enough to prevent this, then the expected step can end up being very small, resulting in very 
slow learning. The notions of momentum (Derthick, 1984), of Polyak-Ruppert averaging (Polyak, 1991; 
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Ruppert, 1988; Polyak and Juditsky, 1992), or further extensions of these ideas may significantly help. 
Methods for adaptively setting separate step sizes for different components of the parameter vector 
are also pertinent (e.g., Jacobs, 1988; Sutton, 1992), as are the “importance weight aware” updates of 
Karampatziakis and Langford (2010). 

In Chapter 5 we saw how weighted importance sampling is significantly better behaved, with lower 
variance updates, than ordinary importance sampling. However, adapting weighted importance sam¬ 
pling to function approximation is challenging and can probably only be done approximately with O(d) 
complexity (Malnnood and Sutton, 2015). 

The Tree Backup algorithm (Section 7.5) shows that it is possible to perform some off-policy learning 
without using importance sampling. This idea has been extended to the off-policy case to produce 
stable and more efficient methods by Munos, Stepleton, Harutyunyan, and Bellemare (2016) and by 
Mahmood, Yu and Sutton (2017). 

Another, complementary strategy is to allow the target policy to be determined in part by the 
behavior policy, in such a way that it never can be so different from it to create large importance 
sampling ratios. For example, the target policy can be defined by reference to the behavior policy, as 
in the “recognizers” proposed by Precup et al. (2005). 


11.10 Summary 

Off-policy learning is a tempting challenge, testing our ingenuity in designing stable and efficient learning 
algorithms. Tabular Q-learning makes off-policy learning seem easy, and it has natural generalizations 
to Expected Sarsa and to the Tree Backup algorithm. But as we have seen in this chapter, the extension 
of these ideas to significant function approximation, even linear function approximation, involves new 
challenges and forces us to deepen our understanding of reinforcement learning algorithms. 

Why go to such lengths? One reason to seek off-policy algorithms is to give flexibility in dealing 
with the tradeoff between exploration and exploitation. Another is to free behavior from learning, and 
avoid the tyranny of the target policy. TD learning appears to hold out the possibility of learning about 
multiple things in parallel, of using one stream of experience to solve many tasks simultaneously. We 
can certainly do this in special cases, just not in every case that we would like to or as efficiently as we 
would like to. 

In this chapter we divided the challenge of off-policy learning into two parts. The first part, correcting 
the targets of learning for the behavior policy, is straightforwardly dealt with using the techniques 
devised earlier for the tabular case, albeit at the cost of increasing the variance of the updates and 
thereby slowing learning. High variance will probably always remains a challenge for off-policy learning. 

The second part of the challenge of off-policy learning emerges as the instability of semi-gradient TD 
methods that involve bootstrapping. We seek powerful function approximation, off-policy learning, and 
the efficiency and flexibility of bootstrapping TD methods, but it is challenging to combine all three 
aspects of this deadly triad in one algorithm without introducing the potential for instability. There 
have been several attempts. The most popular has been to seek to perform true stochastic gradient 
descent (SGD) in the Bellman error (a.k.a. the Bellman residual). However, our analysis concludes 
that this is not an appealing goal in many cases, and that anyway it is impossible to achieve with a 
learning algorithm—the gradient of the BE is not learnable from experience that reveals only feature 
vectors and not underlying states. Another approach, Gradient-TD methods, performs SGD in the 
projected Bellman error. The gradient of the PBE is learnable with O(d) complexity, but at the cost 
of a second parameter vector with a second step size. The newest family of methods, Emphatic-TD 
methods, refine an old idea for reweighting updates, emphasizing some and de-emphasizing others. In 
this way they restore the special properties that make on-policy learning stable with computationally 
simple semi-gradient methods. 
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The whole area of off-policy learning is relatively new and unsettled. Which methods are best or even 
adequate is not yet clear. Are the complexities of the new methods introduced at the end of this chapter 
really necessary? Which of them can be combined effectively with variance reductions methods? The 
potential for off-policy learning remains tantalizing, the best way to achieve it still a mystery. 


Bibliographical and Historical Remarks 

11.1 The first semi-gradient method was linear TD(A) (Sutton, 1988). The name “semi-gradient” is 
more recent (Sutton, 2015a). Semi-gradient off-policy TD(0) with general importance-sampling 
ratio may not have been explicitly stated until Sutton, Mahmood, and White (2016), but the 
action-value forms were introduced by Precup, Sutton, and Singh (2000), who also did eligibility 
trace forms of these algorithms (see Chapter 12). Their continuing, undiscounted forms have 
not been significantly explored. The atomic multi-step forms given here are new. 

11.2 The earliest w-to-2w example was given by Tsitsiklis and Van Roy (1996), who also introduced 
the specific counterexample in the box on page 214. Baird’s counterexample is due to Baird 
(1995), though the version we present here is slightly modified. Averaging methods for function 
approximation were developed by Gordon (1995, 1996). Other examples of instability with off- 
policy DP methods and more complex methods of function approximation are given by Boyan 
and Moore (1995). Bradtke (1993) gives an example in which Q-learning using linear function 
approximation in a linear quadratic regulation problem converges to a destabilizing policy. 

11.3 The deadly triad was first identified by Sutton (1995b) and thoroughly analyzed by Tsitsiklis 
and Van Roy (1997). The name “deadly triad” is due to Sutton (2015a). 

11.4 This kind of linear analysis was pioneered by Tsitsiklis and Van Roy (1996; 1997), including 
the dynamic programming operator. Diagrams like Figure 11.3 were introduced by Lagoudakis 
and Parr (2003). 

11.5 The BE was first proposed as an objective function for dynamic programming by Schweitzer and 
Seidmann (1985). Baird (1995, 1999) extended it to TD learning based on stochastic gradient 
descent, and Engel, Mannor, and Meir (2003) extended it to least squares ( 0{d 2 )) methods 
known as Gaussian Process TD learning. In the literature, BE minimization is often referred 
to as Bellman residual minimization. 

The earliest A-split example is due to Dayan (1992). The two forms given here were introduced 
by Sutton et al. (2009). 

11.6 The contents of this section are new to this text. 

11.7 Gradient-TD methods were introduced by Sutton, Szepesvari, and Maei (2009). The methods 
highlighted in this section were introduced by Sutton et al. (2009) and Mahmood et al. (2014). 
The most sensitive empirical investigations to date of gradient-TD and related methods are 
given by Geist and Scherrer (2014), Dann, Neumann, and Peters (2014), and White (2015). 

11.8 Emphatic-TD methods were introduced by Sutton, Mahmood, and White (2016). Full conver¬ 
gence proofs and other theory were later established by Yu (2015a; 2015b; Yu, Mahmood, and 
Sutton, 2017) and Hallak, Tamar, Munos, and Mannor (2015). 



234 


CHAPTER 11. *OFF-POLICY METHODS WITH APPROXIMATION 



Chapter 12 


Eligibility Traces 


Eligibility traces are one of the basic mechanisms of reinforcement learning. For example, in the popular 
TD(A) algorithm, the A refers to the use of an eligibility trace. Almost any temporal-difference (TD) 
method, such as Q-learning or Sarsa, can be combined with eligibility traces to obtain a more general 
method that may learn more efficiently. 

Eligibility traces unify and generalize TD and Monte Carlo methods. When TD methods are aug¬ 
mented with eligibility traces, they produce a family of methods spanning a spectrum that has Monte 
Carlo methods at one end (A = 1) and one-step TD methods at the other (A = 0). In between are 
intermediate methods that are often better than either extreme method. Eligibility traces also provide 
a way of implementing Monte Carlo methods online and on continuing problems without episodes. 

Of course, we have already seen one way of unifying TD and Monte Carlo methods: the n-step TD 
methods of Chapter 7. What eligibility traces offer beyond these is an elegant algorithmic mechanism 
with significant computational advantages. The mechanism is a short-term memory vector, the eligibility 
trace z t £ R d , that parallels the long-term weight vector w t £ R d . The rough idea is that when a 
component of w t participates in producing an estimated value, then the corresponding component of 
z t is bumped up and then begins to fade away. Learning will then occur in that component of w t if 
a nonzero TD error occurs before the trace falls back to zero. The trace-decay parameter A £ [0,1] 
determines the rate at which the trace falls. 

The primary computational advantage of eligibility traces over n-step methods is that only a single 
trace vector is required rather than a store of the last n feature vectors. Learning also occurs continually 
and uniformly in time rather than being delayed and then catching up at the end of the episode. In 
addition learning can occur and affect behavior immediately after a state is encountered rather than 
being delayed n steps. 

Eligibility traces illustrate that a learning algorithm can sometimes be implemented in a different way 
to obtain computational advantages. Many algorithms are most naturally formulated and understood 
as an update of a state’s value based on events that follow that state over multiple future time steps. 
For example, Monte Carlo methods (Chapter 5) update a state based on all the future rewards, and 
n-step TD methods (Chapter 7) update based on the next n rewards and state n steps in the future. 
Such formulations, based on looking forward from the updated state, are called forward views. Forward 
views are always somewhat complex to implement because the update depends on later things that are 
not available at the time. However, as we show in this chapter it is often possible to achieve nearly 
the same updates—and sometimes exactly the same updates—with an algorithm that uses the current 
TD error, looking backward to recently visited states using an eligibility trace. These alternate ways of 
looking at and implementing learning algorithms are called backward views. Backward views, transfor¬ 
mations between forward-views and backward-views, and equivalences between them date back to the 
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introduction of temporal difference learning, but have become much more powerful and sophisticated 
since 2014. Here we present the basics of the modern view. 

As usual, first we fully develop the ideas for state values and prediction, then extend them to action 
values and control. We develop them first for the on-policy case then extend them to off-policy learning. 
Our treatment pays special attention to the case of linear function approximation, for which the results 
with eligibility traces are stronger. All these results apply also to the tabular and state aggregation 
cases because these are special cases of linear function approximation. 


12.1 The A-return 

In Chapter 7 we defined an n-step return as the sum of the first n rewards plus the estimated value of 
the state reached in n steps, each appropriately discounted (7.1). The general form of that equation, 
for any parameterized function approximator, is 

Gf.t+n = Rt + l + lRt+2 + • • • + 7 " 1 Rt+n + 7 n '0(<S't+nj'W t _|_„_i), 0 < t < T — tl. (12.1) 

We noted in Chapter 7 that each n-step return, for n > 1, is a valid update target for a tabular learning 
update, just as it is for an approximate SGD learning update such as (9.7). 

Now we note that a valid update can be done not just toward any n-step return, but toward any 
average of n-step returns. For example, an update can be done toward a target that is half of a two-step 
return and half of a four-step return: kGt-t +2 + kGf.t+ 4 - Any set of n-step returns can be averaged 
in this way, even an infinite set, as long as the weights on the component returns are positive and 
sum to 1. The composite return possesses an error reduction property similar to that of individual 
n-step returns (7.3) and thus can be used to construct updates with guaranteed convergence properties. 
Averaging produces a substantial new range of algorithms. For example, one could average one-step and 
infinite-step returns to obtain another way of interrelating TD and Monte Carlo methods. In principle, 
one could even average experience-based updates with DP updates to get a simple combination of 
experience-based and model-based methods (cf. Chapter 8). 

An update that averages simpler component updates is called a compound update. The 
backup diagram for a compound update consists of the backup diagrams for each of the 
component updates with a horizontal line above them and the weighting fractions below. 

For example, the compound update for the case mentioned at the start of this section, 
mixing half of a two-step return and half of a four-step return, has the diagram shown 
to the right. A compound update can only be done when the longest of its component 
updates is complete. The update at the right, for example, could only be done at time 
t + 4 for the estimate formed at time t. In general one would like to limit the length of 
the longest component update because of the corresponding delay in the updates. 

The TD(A) algorithm can be understood as one particular way of averaging n-step 
updates. This average contains all the n-step updates, each weighted proportional to 
A ra_1 , where A € [0,1], and is normalized by a factor of 1 — A to ensure that the weights 
sum to 1 (see Figure 12.1). The resulting update is toward a return, called the A -return, 
defined in its state-based form by 

OO 

G} = (1 — A) ^ X n ~ 1 G t: t+n- (12.2) 

n—1 

Figure 12.2 further illustrates the weighting on the sequence of n-step returns in the A- 
return. The one-step return is given the largest weight, 1 — A; the two-step return is given 
the next largest weight, (1 — A)A; the three-step return is given the weight (1 — A)A 2 ; and so 
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Figure 12.1: The update digram for TD(A). If A = 0, then the overall update reduces to its first component, 
the one-step TD update, whereas if A = 1, then the overall update reduces to its last component, the Monte 
Carlo update. 



Time 


Figure 12.2: Weighting given in the A-return to each of the n-step returns. 


weight fades by A with each additional step. After a terminal state has been reached, all subsequent 
?r-step returns are equal to Gt . If we want, we can separate these post-termination terms from the main 
sum, yielding 

T-t-l 

Gf = (1 — A) A" 1 Gt-.t+n + A T * 1 Gt, (12.3) 

n =1 

as indicated in the figures. This equation makes it clearer what happens when A = 1. In this case the 
main sum goes to zero, and the remaining term reduces to the conventional return, G t . Thus, for A = 1, 
updating according to the A-return is a Monte Carlo algorithm. On the other hand, if A = 0, then the 
A-return reduces to Gp.t+ 1 , the one-step return. Thus, for A = 0, updating according to the A-return is 
a one-step TD method. 

Exercise 12.1 Just as the return can be written recursively in terms of the first reward and itself 
one-step later (3.9), so can the A-return. Derive the analogous recursive relationship from (12.2) and 
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( 12 . 1 ). □ 

Exercise 12.2 The parameter A characterizes how fast the exponential weighting in Figure 12.2 falls 
off, and thus how far into the future the A-return algorithm looks in determining its update. But a 
rate factor such as A is sometimes an awkward way of characterizing the speed of the decay. For some 
purposes it is better to specify a time constant, or half-life. What is the equation relating A and the 
half-life, t\, the time by which the weighting sequence will have fallen to half of its initial value? □ 

We are now ready to define our first learning algorithm based on the A-return: the off-line X-return 
algorithm. As an off-line algorithm, it makes no changes to the weight vector during the episode. 
Then, at the end of the episode, a whole sequence of off-line updates are made according to our usual 
semi-gradient rule, using the A-return as the target: 


Wt+i 


w t 


Gt~v(S t , w t ) VD(5 t ,w t ), 


f = 0,..., T — 1. 


(12.4) 


The A-return gives us an alternative way of moving smoothly between Monte Carlo and one-step TD 
methods that can be compared with the ?r-step TD way of Chapter 7. There we assessed effectiveness 
on a 19-state random walk task (Example 7.1). Figure 12.3 shows the performance of the off-line A- 
return algorithm on this task alongside that of the n-step methods (repeated from Figure 7.2). The 
experiment was just as described earlier except that for the A-return algorithm we varied A instead of 
n. The performance measure used is the estimated root-mean-squared error between the correct and 
estimated values of each state measured at the end of the episode, averaged over the first 10 episodes 
and the 19 states. Note that overall performance of the off-line A-return algorithms is comparable to 
that of the n-step algorithms. In both cases we get best performance with an intermediate value of the 
bootstrapping parameter, n for ?r-step methods and A for the offline A-return algorithm. 
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Figure 12.3: 19-state Random walk results (Example 7.1): Performance of the offline A-return algorithm 
alongside that of the n-step TD methods. In both case, intermediate values of the bootstrapping parameter (A 
or n) performed best. The results with the off-line A-return algorithm are slightly better at the best values of 
a and A, and at high a. 


The approach that we have been taking so far is what we call the theoretical, or forward , view of a 
learning algorithm. For each state visited, we look forward in time to all the future rewards and decide 
how best to combine them. We might imagine ourselves riding the stream of states, looking forward 
from each state to determine its update, as suggested by Figure 12.4. After looking forward from and 
updating one state, we move on to the next and never have to work with the preceding state again. 
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Figure 12.4: The forward view. We decide how to update each state by looking forward to future rewards and 
states. 

Future states, on the other hand, are viewed and processed repeatedly, once from each vantage point 
preceding them. 


12.2 TD(A) 

TD(A) is one of the oldest and most widely used algorithms in reinforcement learning. It was the first 
algorithm for which a formal relationship was shown between a more theoretical forward view and a 
more computationally congenial backward view using eligibility traces. Here we will show empirically 
that it approximates the off-line A-return algorithm presented in the previous section. 

TD(A) improves over the off-line A-return algorithm in three ways. First it updates the weight vector 
on every step of an episode rather than only at the end, and thus its estimates may be better sooner. 
Second, its computations are equally distributed in time rather that all at the end of the episode. And 
third, it can be applied to continuing problems rather than just episodic problems. In this section we 
present the semi-gradient version of TD(A) with function approximation. 

With function approximation, the eligibility trace is a vector z t £ with the same number of 
components as the weight vector w t . Whereas the weight vector is a long-term memory, accumulating 
over the lifetime of the system, the eligibility trace is a short-term memory, typically lasting less time 
than the length of an episode. Eligibility traces assist in the learning process; their only consequence is 
that they affect the weight vector, and then the weight vector determines the estimated value. 

In TD(A), the eligibility trace vector is initialized to zero at the beginning of the episode, is incre¬ 
mented on each time step by the value gradient, and then fades away by 7 A: 

Z_1 — O’ (12 51 

z t = qAz t _i + V0(£t,w t ), 0 < f < T, V ■) 

where 7 is the discount rate and A is the parameter introduced in the previous section. The eligibility 
trace keeps track of which components of the weight vector have contributed, positively or negatively, 
to recent state valuations, where “recent” is defined in terms 7 A. The trace is said to indicate the 
eligibility of each component of the weight vector for undergoing learning changes should a reinforcing 
event occur. The reinforcing events we are concerned with are the moment-by-moment one-step TD 
errors. The TD error for state-value prediction is 

S t = Rt+i+'rv(St + i,w t )-v(St,Wt). ( 12 . 6 ) 

In TD(A), the weight vector is updated on each step proportional to the scalar TD error and the vector 
eligibility trace: 


w t+1 = w t + a5 t z t 


(12.7) 
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Complete pseudocode for TD(A) is given in the box, and a picture of its operation is suggested by 
Figure 12.5. 



TD(A) is oriented backward in time. At each moment we look at the current TD error and assign 
it backward to each prior state according to how much that state contributed to the current eligibility 
trace at that time. We might imagine ourselves riding along the stream of states, computing TD errors, 
and shouting them back to the previously visited states, as suggested by Figure 12.5. Where the TD 
error and traces come together, we get the update given by (12.7). 

To better understand the backward view, consider what happens at various values of A. If A = 0, 
then by (12.5) the trace at t is exactly the value gradient corresponding to S t . Thus the TD(A) update 
(12.7) reduces to the one-step semi-gradient TD update treated in Chapter 9 (and, in the tabular case, 
to the simple TD rule (6.2)). This is why that algorithm was called TD(0). In terms of Figure 12.5, 
TD(0) is the case in which only the one state preceding the current one is changed by the TD error. For 
larger values of A, but still A < 1, more of the preceding states are changed, but each more temporally 
distant state is changed less because the corresponding eligibility trace is smaller, as suggested by the 
figure. We say that the earlier states are given less credit for the TD error. 



Figure 12.5: The backward or mechanistic view. Each update depends on the current TD error combined with 
the current eligibility traces of past events. 
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Figure 12.6: 19-state Random walk results (Example 7.1): Performance of TD(A) alongside that of the off-line 
A-return algorithm. The two algorithms performed virtually identically at low (less than optimal) a values, but 
TD(A) was worse at high a values. 


If A = 1, then the credit given to earlier states falls only by 7 per step. This turns out to be just 
the right thing to do to achieve Monte Carlo behavior. For example, remember that the TD error, S t , 
includes an undiscounted term of Rt+i- In passing this back k steps it needs to be discounted, like any 
reward in a return, by 'y k , which is just what the falling eligibility trace achieves. If A = 1 and 7 = 1 , 
then the eligibility traces do not decay at all with time. In this case the method behaves like a Monte 
Carlo method for an undiscounted, episodic task. If A = 1, the algorithm is also known as TD(1). 

TD(1) is a way of implementing Monte Carlo algorithms that is more general than those presented 
earlier and that significantly increases their range of applicability. Whereas the earlier Monte Carlo 
methods were limited to episodic tasks, TD(1) can be applied to discounted continuing tasks as well. 
Moreover, TD(1) can be performed incrementally and on-line. One disadvantage of Monte Carlo meth¬ 
ods is that they learn nothing from an episode until it is over. For example, if a Monte Carlo control 
method takes an action that produces a very poor reward but does not end the episode, then the agent’s 
tendency to repeat the action will be undiminished during the episode. On-line TD(1), on the other 
hand, learns in an n-step TD way from the incomplete ongoing episode, where the n steps are all the 
way up to the current step. If something unusually good or bad happens during an episode, control 
methods based on TD(1) can learn immediately and alter their behavior on that same episode. 

It is revealing to revisit the 19-state random walk example (Example 7.1) to see how well TD(A) 
does in approximating the off-line A-return algorithm. The results for both algorithms are shown in 
Figure 12.6. For each A value, if a is selected optimally for it (or smaller), then the two algorithms 
perform virtually identically. If a is chosen larger than is optimal, however, then the A-return algorithm 
is only a little worse whereas TD(A) is much worse and may even be unstable. This is not catastrophic 
for TD(A) on this problem, as these higher parameter values are not what one would want to use 
anyway, but for other problems it can be a significant weakness. 

Linear TD(A) has been proved to converge in the on-policy case if the step-size parameter is reduced 
over time according to the usual conditions (2.7). Just as discussed in Section 9.4, convergence is not 
to the minimum-error weight vector, but to a nearby weight vector that depends on A. The bound on 
solution quality presented in that section (9.14) can now be generalized to apply to any A. For the 
continuing discounted case, 


VE(w 00 ) < 


1 — 7 A 


min VE(w). 

W 


( 12 . 8 ) 


I-7 
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That is, the asymptotic error is no more than times the smallest possible error. As A approaches 
1, the bound approaches the minimum error (and it is loosest at A = 0). In practice, however, A = 1 is 
often the poorest choice, as will be illustrated later in Figure 12.14. 

Exercise 12.3 Some insight into how TD(A) can closely approximate the off-line A-return algorithm 
can be gained by seeing that the latter’s error term (from (12.4)) can be written as the sum of TD 
errors (12.6) for a single fixed w. Show this, following the pattern of (6.6), and using the recursive 
relationship for the A-return you obtained in Exercise 12.1. □ 

*Exercise 12.4 Although online TD(A) is not equivalent to the A-return algorithm, perhaps there’s 
a slightly different online TD method that would maintain equivalence. One idea is to define the TD 
error instead as \J t = Rt+i +jV t (S t +i) — V t -\(S t ). Show that in this case the modified TD(A) algorithm 
would then achieve exactly 


A V t (S t ) = a 


Gi 


Vt-i{S t ) , 


even in the case of on-line updating with large a. In what ways might this modified TD(A) be better 
or worse than the conventional one described in the text? Describe an experiment to assess the relative 
merits of the two algorithms. □ 


12.3 n-step Truncated A-return Methods 

The off-line A-return algorithm is an important ideal, but it’s of limited utility because it uses the 
A-return (12.2), which is not known until the end of the episode. In the continuing case, the A-return 
is technically never known, as it depends on n-step returns for arbitrarily large n, and thus on rewards 
arbitrarily far in the future. However, the dependence gets weaker for long-delayed rewards, falling by 
7 A for each step of delay. A natural approximation then would be to truncate the sequence after some 
number of steps. Our existing notion of n-step returns provides a natural way to do this in which the 
missing rewards are replaced with estimated values. 

In general, we define the truncated A-return for time t, given data only up to some later horizon, h, 
as 

h-t-l 

Glh = (1-A) Y, xn ~ lG t:t+n + A h ~ t ~ 1 G t:h , 0 <t<h<T. (12.9) 

n—1 

If you compare this equation with the A-return (12.3), it is clear that the horizon h is playing the same 
role as was previously played by T, the time of termination. Whereas in the A-return there is a residual 
weighting given to the true return, here it is given to the longest available n-step return, the (h—t )-step 
return (Figure 12.2). 

The truncated A-return immediately gives rise to a family of n-step A-return algorithms similar to 
the n-step methods of Chapter 7. In all these algorithms, updates are delayed by n steps and only take 
into account the first n rewards, but now all the fc-step returns are included for 1 < k < n (whereas the 
earlier n-step algorithms used only the n-step return), weighted geometrically as in Figure 12.2. In the 
state-value case, this family of algorithms is known as truncated TD(A), or TTD(A). The compound 
backup diagram, shown in Figure 12.7, is similar to that for TD(A) (Figure 12.1) except that the longest 
component update is at most n steps rather than always going all the way to the end of the episode. 
TTD(A) is defined by (cf. (9.15)): 

w t+n = w t+n _! + a [Gf. t+n - v(S tl w t + n -i)] Vu(5 t ,w t+n _ 1 ), 0 < t < T. (12.10) 

This algorithm can be implemented efficiently so that per-step computation does not scale with n 
(though of course memory must). Much as in n-step TD methods, no updates are made on the first 
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Figure 12.7: The backup diagram for truncated TD(A). 


n — 1 time steps, and n — 1 additional updates are made upon termination. Efficient implementation 
relies on the fact that the fc-step A-return can be written exactly as 

t-\-k— 1 

Gt-.t+k = + £ (7A y-% (12.11) 

i—t 

where 

5' t = R t+ 1 + 7t)(5 t+ i,w t ) - v(S t , w t _i). (12.12) 

Exercise 12.5 Several times in this book (often in exercises) we have established that returns can be 
written as sums of TD errors if the value function is held constant. Why is (12.11) another instance of 
this? Prove (12.11). □ 


12.4 Redoing Updates: The Online A-return Algorithm 

Choosing the truncation parameter n in Truncated TD(A) involves a tradeoff, n should be large so that 
the method closely approximates the off-line A-return algorithm, but it should also be small so that the 
updates can be made sooner and can influence behavior sooner. Can we get the best of both? Well, 
yes, in principle we can, albeit at the cost of computational complexity. 

The idea is that, on each time step as you gather a new increment of data, you go back and redo all 
the updates since the beginning of the current episode. The new updates will be better than the ones 
you previously made because now they can take into account the time step’s new data. That is, the 
updates are always towards an n-step truncated A-return target, but they always use the latest horizon. 
In each pass over that episode you can use a slightly longer horizon and obtain slightly better results. 
Recall that the n-step truncated A-return is defined by 

h-t-l 

Gph = (1 ~ A) A" 1 Gt:t+n + A ft f 1 Gt:h- 

n—1 


(12.9) 
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Let us step through how this target could ideally be used if computational complexity was not an 
issue. The episode begins with an estimate at time 0 using the weights wq from the end of the previous 
episode. Learning begins when the data horizon is extended to time step 1. The target for the estimate 
at step 0, given the data up to horizon 1, could only be the one-step return Gou, which includes R[ 
and bootstraps from the estimate u(S'i,wo). Note that this is exactly what Gg. x is, with the sum in 
the first term of (12.9) degenerating to zero. Using this update target, we construct W[. Then, after 
advancing the data horizon to step 2 , what do we do? We have new data in the form of R 2 and S 2 , as 
well as the new wq, so now we can construct a better update target Gq. 2 for the first update from So as 
well as a better update target Gq :2 for the second update from Si. We perform both of these updates 
in sequence to produce w 2 . Now we advance the horizon to step 3 and repeat, going all the way back 
to produce three new updates and finally w 3 , and so on. 

This conceptual algorithm involves multiple passes over the episode, one at each horizon, each gen¬ 
erating a different sequence of weight vectors. To describe it clearly we have to distinguish between the 
weight vectors computed at the different horizons. Let us use wj 1 to denote the weights used to generate 
the value at time t in the sequence at horizon h. The first weight vector w() in each sequence is that 
inherited from the previous episode, and the last weight vector in each sequence defines the ultimate 
weight-vector sequence of the algorithm. At the final horizon h = T we obtain the final weights 
which will be passed on to form the initial weights of the next episode. With these conventions, the 
three first sequences described in the previous paragraph can be given explicitly: 

h= 1 : wj=wj + a [Gq :1 - ■O(S'o)Wg)] VD(5 0 , wj), 

h = 2: w 3 = Wq + a [Gq : 2 - D(5 0 ,Wo)] Vv(S 0 , w 3 ), 
w% = wf + a [G 3:2 - D(5i,w 3 )] Vv(Si,Wi), 

h = 3 : w? = + a [Gq :3 - D(5' 0 ,w^)] VD(5 0 ,wg), 

w 2 = wj + a [G 3:3 - D(5i,wf)] Vv(Si,wf), 
w 3 = W 2 + a [ G 2:3 - *>(£ 2 ^ 2 )] Vv(S 2 ,W$). 

The general form for the update is 

w? + 1 =w t fc + a [G* :h - v(S t ,w*)] Vt>(5 t ,w?), 0 < t < h < T. (12.13) 

This update, together with w t = w( defines the online X-return algorithm. 

The online A-return algorithm is fully online, determining a new weight vector w t at each step t during 
an episode, using only information available at time t. It’s main drawback is that it is computationally 
complex, passing over the entire episode so far on every step. Note that it is strictly more complex 
than the off-line A-return algorithm, which passes through all the steps at the time of termination but 
does not make any updates during the episode. In return, the online algorithm can be expected to 
perform better than the off-line one, not only during the episode when it makes an update while the 
off-line algorithm makes none, but also at the end of the episode because the weight vector used in 
bootstrapping (in G^. h ) has had a greater number of informative updates. This effect can be seen if one 
looks carefully at Figure 12.8, which compares the two algorithms on the 19-state random walk task. 
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Figure 12.8: 19-state Random walk results (Example 7.1): Performance of online and off-line A-return algo¬ 
rithms. The performance measure here is the VE at the end of the episode, which should be the best case for 
the off-line algorithm. Nevertheless, the on-line algorithm performs subtlely better. For comparison, the A = 0 
line is the same for both methods. 


12.5 True Online TD(A) 


The on-line A-return algorithm just presented is currently the best performing temporal-difference 
algorithm. It is an ideal which online TD(A) only approximates. As presented, however, the on-line 
A-return algorithm is very complex. Is there a way to invert this forward-view algorithm to produce 
an efficient backward-view algorithm using eligibility traces? It turns out that there is indeed an 
exact computationally congenial implementation of the on-line A-return algorithm for the case of linear 
function approximation. This implementation is known as the true online TD(A) algorithm because it 
is “truer” to the ideal of the online A-return algorithm than the TD(A) algorithm is. 

The derivation of true on-line TD(A) is a little too complex to present here (see the next section and 
the appendix to the paper by van Seijen et ah, 2016) but its strategy is simple. The sequence of weight 
vectors produced by the on-line A-return algorithm can be arranged in a triangle: 


w° 

w 0 
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(12.14) 


One row of this triangle is produced on each time step. It turns out that only the weight vectors on the 
diagonal, the w^, are really needed. The first, Wq, is the input, the last, w Tp, is the output, and each 
weight vector along the way, w£, plays a role in bootstrapping in the n-step returns of the updates. 
In the final algorithm the diagonal weight vectors are renamed without a superscript, w t = w£. The 
strategy then is to find a compact, efficient way of computing each w* from the one before. If this is 
done, for the linear case in which v(s, w) = w T x(s), then we arrive at the true online TD(A) algorithm: 


w t+1 = w t + a6 t z t + a (w Jx t - w^iX*) (z t - x t ), 


(12.15) 
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where we have used the shorthand x t = x(S't), 6 t is defined as in TD(A) (12.6), and z t is defined by 

z* = 7 Az t _i + (l - Q 7 'Az t l l 1 x t ) x 4 . (12.16) 

This algorithm has been proven to produce exactly the same sequence of weight vectors, w t , 0 < t < T, 
as the on-line A-return algorithm (van Siejen et al. 2016). Thus the results on the random walk task 
on the left of Figure 12.8 are also its results on that task. Now, however, the algorithm is much 
less expensive. The memory requirements of true online TD(A) are identical to those of conventional 
TD(A), while the per-step computation is increased by about 50% (there is one more inner product in 
the eligibility-trace update). Overall, the per-step computational complexity remains of 0(d), the same 
as TD(A). Pseudocode for the complete algorithm is given in the box. 


True Online TD(A) for estimating w T x « 


Input: the policy 7 r to be evaluated 

Initialize value-function weights w arbitrarily (e.g., w = 0) 

Repeat (for each episode): 

Initialize state and obtain initial feature vector x 

z i — 0 (an d-dimensional vector) 

Void t— 0 (a scalar temporary variable) 

Repeat (for each step of episode): 

Choose A ~ 7 r 

Take action A, observe R , x' (feature vector of the next state) 

V <- w T x 
V' <- w T x' 

6 <- R + 'jV' -V 
z «— yAz + (l — ayAz T x) x 
w <- w + a(S + V - V 0 i d )z - a(V - V oU )x 
Void <- V 
X x' 

until x' = 0 (signaling arrival at a terminal state) 


The eligibility trace (12.16) used in true online TD(A) is called a dutch trace to distinguish it from 
the trace (12.5) used in TD(A), which is called an accumulating trace. Earlier work often used a third 
kind of trace called the replacing trace , defined only for the tabular case or for binary feature vectors 
such as those produced by tile coding. The replacing trace is defined on a component-by-component 
basis depending on whether the component of the feature vector was 1 or 0 : 


| 1 if x i:t = 1 

\ yAz^-i otherwise. 


(12.17) 


Nowadays, use of the replacing trace is deprecated; a dutch trace should almost always be used instead. 
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12.6 Dutch Traces in Monte Carlo Learning 

Although eligibility traces are closely associated historically with TD learning, in fact they have nothing 
to do with it. In fact, eligibility traces arise even in Monte Carlo learning, as we show in this section. 
We show that the linear MC algorithm (Chapter 9), taken as a forward view, can be used to derive an 
equivalent yet computationally cheaper backward-view algorithm using dutch traces. This is the only 
equivalence of forward- and backward-views that we explicitly demonstrate in this book. It gives some 
of the flavor of the proof of equivalence of true online TD(A) and the on-line A-return algorithm, but is 
much simpler. 

The linear version of the gradient Monte Carlo prediction algorithm (page 165) makes the following 
sequence of updates, one for each time step of the episode: 

w (+ i=w t +a[G-w t T x t ]x t , 0 < t < T. (12.18) 

To make the example simpler, we assume here that the return G is a single reward received at the end 
of the episode (this is why G is not subscripted by time) and that there is no discounting. In this case 
the update is also known as the least mean square (LMS) rule. As a Monte Carlo algorithm, all the 
updates depend on the final reward/return, so none can be made until the end of the episode. The 
MC algorithm is an offline algorithm and we do not seek to improve this aspect of it. Rather we seek 
merely an implementation of this algorithm with computational advantages. We will still update the 
weight vector only at the end of the episode, but we will do some computation during each step of 
the episode and less at its end. This will give a more equal distribution of computation— 0(d) per 
step—and also remove the need to store the feature vectors at each step for use later at the end of each 
episode. Instead, we will introduce an additional vector memory, the eligibility trace, keeping in it a 
summary of all the feature vectors seen so far. This will be sufficient to efficiently recreate exactly the 
same overall update as the sequence of MC updates (12.18), by the end of the episode: 

W T = WT-l + OL (G — wJ^X-p-l) XT -1 

= wt-i + qxt_i (—xJ_ 1 wt’_i) + aGx-T-i 
= (I — axT-ixJ_ 1 ) wr_i + aGx-T-i 

= + cxG'x.t— i 

where F t = I — axjx^ is a forgetting , or fading , matrix. Now, recursing, 

= F■/■ i (F^ n _2VVT—2 T aG f x T— 2 ) 4- QiGx^—i 

= Ft’_iFt’_2W7’_2 + OlG (Ft-IX-T-2 + x T-l) 

= F'T__iFj ’_2 (Ft_3W7’_3 + aGx.T- 3 ) + OlG (Ft’_iXt ’_2 + Xr_i) 

= F7 1 _.iFj , _2F T— 3 WT —3 + OlG (Ft’_iFt’_2Xt-3 + Ft-iX-T -2 + x T-l) 


T-l 

= Ft-iFt-2 • • • FqWq + aG ^ Ft-iFt- 2 • • • Ffc + iXfc 

&T-1 x 

zt-i 

= slt —1 + cxGzj’—i , (12.19) 


where a^-i and zt-i are the values at time T — 1 of two auxilary memory vectors that can be updated 
incrementally without knowledge of G and with 0(d) complexity per time step. The z t vector is in fact 
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a dutch-style eligibility trace. It is initialized to Zq = Xo and then updated according to 

t 

z* = F t Fj_i • • • Ffc + iXfc, 1 < t < T 

k =o 
t—i 

= F t F t _i • • • F^. +1 Xfc + x t 
k =0 
t -1 

= F t F t _iF (_2 • • • Ffe+iXfc + x t 
k -o 

= F f Z t _! +x t 
= (I - ax t xj ) z t _i + x t 
= z t _i - ax t xj z t _i + x t 
= z t _i - a (z7_ix t ) x t + x t 
= z t _i + (l - az7_ix t ) x ti 

which is the dutch trace for the case of yA = l (cf. Eq. 12.16). The a t auxilary vector is initialized to 
a-o = wo and then updated according to 

a t = F t F t _! • • • F 0 w 0 = F t a t _! = a t _! - ax t x7a t _!, 1 < t < T. (12.20) 

The auxiliary vectors, a f and z t , are updated on each time step t < T and then, at time T when G 
is observed, they are used in (12.19) to compute w t- In this way we achieve exactly the same final 
result as the MC/LMS algorithm with poor computational properties (12.18), but with an incremental 
algorithm whose time and memory complexity per step is O(d). This is surprising and intriguing 
because the notion of an eligibility trace (and the dutch trace in particular) has arisen in a setting 
without temporal-difference (TD) learning (in contrast to Van Seijen and Sutton, 2014). It seems 
eligibility traces are not specific to TD learning at all; they are more fundamental than that. The 
need for eligibility traces seems to arise whenever one tries to learn long-term predictions in an efficient 
manner. 


12.7 Sarsa(A) 


Very few changes in the ideas already presented in this chapter are required in order to extend eligibility- 
traces to action-value methods. To learn approximate action values, q(s , a, w), rather than approximate 
state values, v(s,w), we need to use the action-value form of the n-step return, from Chapter 10: 

Gf.t+n — Rt+l + ' ' ' + 7 " 1 Rt+n + 'y n 4{St+ni -^t+n, w t+n—1) 5 (10-4) 


for all n and t such that n > 1 and 0 < t < T — n. Using this, we can form the action-value form of the 
truncated A-return, which is otherwise identical to the state-value form (12.9). The action-value form 
of the off-line A-return algorithm (12.4) simply uses q rather than v: 


w t+ i = w t 


G$ - q(S t , A t \v t ) Vq(S t , A t , w t ), t = 0, 


, T — 1, 


( 12 . 21 ) 


where G^ = G^.^. The compound backup diagram for this forward view is shown in Figure 12.9. Notice 
the similarity to the diagram of the TD(A) algorithm (Figure 12.1). The first update looks ahead one 
full step, to the next state-action pair, the second looks ahead two steps, to the second state-action 
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Figure 12.9: Sarsa(A)’s backup diagram. Compare with Figure 12.1. 


pair, and so on. A final update is based on the complete return. The weighting of each n-step update 
in the A-return is just as in TD(A) and the A-return algorithm (12.3). 

The temporal-difference method for action values, known as Sarsa(X), approximates this forward 
view. It has the same update rule as given earlier for TD(A): 

w t+ i = w t + a6 t z t , (12.7) 

except, naturally, using the action-value form of the TD error: 

St. = Rt+i + lQ(St+u ^t+i) w t) — q{S t , At, w t)> 

and the action-value form of the eligibility trace: 
z_i = 0, 

z t = 7 Az t _i + S7q(S t , A t ,w t ), 0 <t<T 

(or, alternatively, the replacing trace given by (12.17)). Complete pseudocode for Sarsa(A) with linear 
function approximation, binary features, and either accumulating or replacing traces is given in the box 
on the next page. This pseudocode highlights a few optimizations possible in the special case of binary 
features (features are either active (=1) or inactive (=0). 


( 12 . 22 ) 


(12.23) 
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Example 12.1: Traces in Gridworld The use of eligibility traces can substantially increase the 
efficiency of control algorithms over one-step methods and even over n-step methods. The reason for 
this is illustrated by the gridworld example below. 
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The first panel shows the path taken by an agent in a single episode. The initial estimated values were 
zero, and all rewards were zero except for a positive reward at the goal location marked by G. The arrows 
in the other panels show, for various algorithms, which action-values would be increased, and by how 
much, upon reaching the goal. A one-step method would increment only the last action value, whereas an 
?r-step method would equally increment the last n action’s values, and an eligibility trace method would 
update all the action values up to the beginning of the episode to different degrees, fading with recency. 
The fading strategy is often the best tradeoff, strongly learning how to reach the goal from the right, yet 
not as strongly learning the roundabout path to the goal from the left that was taken in this episode. 


Exercise 12.6 Modifiy the pseudocode for Sarsa(A) to use dutch traces (12.16) without the other 
features of a true online algorithm. Assume linear function approximation and binary features. □ 
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Example 12.2: Sarsa(A) on Mountain Car Figure 12.10 (left) shows results with Sarsa(A) on 
the Mountain Car task introduced in Example 10.1. The function approximation, action selection, and 
environmental details were exactly as in Chapter 10, and thus it is appropriate to numerically compare 
these results with the Chapter 10 results for n-step Sarsa (right side of the figure). The earlier results 
varied the update length n whereas here for Sarsa(A) we vary the trace parameter A, which plays a 
similar role. The fading-trace bootstrapping strategy of Sarsa(A) appears to result in more efficient 
learning on this problem. 


Sarsa(A) with replacing traces 


n-step Sarsa 


Mountain Car 

Steps per episode 
averaged over 
first 50 episodes 
and 100 runs 



Figure 12.10: Early performance on the Mountain Car task of Sarsa(A) with replacing traces and n-step Sarsa 
(copied from Figure 10.4) as a function of the step size, a. ■ 

There is also an action-value version of our ideal TD method, the online A-return algorithm presented 
in Section 12.4. Everything in that section goes through without change other than to use the action- 
value form of the n-step return given at the beginning of this section. In the case of linear function 
approximation, the ideal algorithm again has an exact, efficient O(d) implementation, called True Online 
Sarsa(X). The analyses in Sections 12.5 and 12.6 carry through without change other than to use state- 
action feature vectors x t = x(S t ,A t ) instead of state feature vectors x t = x(S't). The pseudocode for 
this algorithm is given in the box on the next page. Figure 12.11 compares the performance of various 
versions of Sarsa(A) on the Mountain Car example. 
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Mountain Car 

Reward per episode 
averaged over 
first 20 episodes 
and 100 runs 



Sarsa(A) with replacing traces 


Figure 12.11: Summary comparison of Sarsa(A) algorithms on the Mountain Car task. True Online Sarsa(A) 
performed better than regular Sarsa(A) with both accumulating and replacing traces. Also included is a version 
of Sarsa(A) with replacing traces in which, on each time step, the traces for the state and the actions not selected 
were set to zero. 
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12.8 Variable A and 7 

We are starting now to reach the end of our development of fundamental TD learning algorithms. 
To present the final algorithms in their most general forms, it is useful to generalize the degree of 
bootstrapping and discounting beyond constant parameters to functions potentially dependent on the 
state and action. That is, each time step will have a different A and 7 , denoted At and 7 t . We change 
notation now so that A : § x A —> [0,1] is now a whole function from states and actions to the unit 
interval such that At = A (St, At), and similarly, 7 : S —► [0,1] is a function from states to the unit 
interval such that 7 t = 7 (St). 

The latter generalization, to state-dependent discounting , is particularly significant because it changes 
the return, the fundamental random variable whose expectation we seek to estimate. Now the return 
is defined more generally as 

G t = Rt.+i + j t+1 G t +i 

= R t +1 + 'y t +l R t+2 + 7t + l7t+2^t+3 + 7t+l7t+27t+3^i+4 d- 

00 k 

= E n 7i, (12.24) 

k—t i=t+1 

where, to assure the sums are finite, we require that JlfeLt 7 *, = 0 with probability one for all t. One 
convenient aspect of this definition is that it allows us to dispense with episodes, start and terminal 
states, and T as special cases and quantities. A terminal state just becomes a state at which y(s) = 0 
and which transitions to the start state. In that way (and by choosing 7 (-) as a constant function) we 
can recover the classical episodic setting as a special case. State dependent discounting includes other 
prediction cases such as soft termination, when we seek to predict a quantity that becomes complete 
but does not alter the flow of the Markov process. Discounted returns themselves can be thought of as 
such a quantity, and state dependent discounting is a deep unification of the episodic and discounted- 
continuing cases. (The undiscounted-continuing case still needs some special treatment.) 

The generalization to variable bootstrapping is not a change in the problem, like discounting, but a 
change in the solution strategy. The generalization affects the A-returns for states and actions. The 
new state-based A-return can be written recursively as 

Gt s = R t +1 + 7t+i ((1 — A t+ i)t)(iS t+ i,w t ) + At+iG^b) > (12.25) 

where now we have added the “s” to the superscript A to remind us that this is a return that bootstraps 
from state values, distinguishing it from returns that bootstrap from action values, which we present 
below with “a” in the superscript. This equation says that the A-return is the first reward, undiscounted 
and unaffected by bootstrapping, plus possibly a second term to the extent that we are not discounting 
at the next state (that is, according to 7 t+1 ; recall that this is zero if the next state is terminal). To the 
extent that we aren’t terminating at the next state, we have a second term which is itself divided into 
two cases depending on the degree of bootstrapping in the state. To the extent we are bootstrapping, 
this term is the estimated value at the state, whereas, to the extent that we not bootstrapping, the 
term is the A-return for the next time step. The action-based A-return is either the Sarsa form 

Gt a = R t+ 1 + 7 t+i ((1 — ^t+i)q(St+i, A t +i, w t ) + At+iG^+i) , (12.26) 

or the Expected Sarsa form, 

Gt a = Rt.+i + 7t+i ((1 — A t+ i)<5t+i + At+iG^®^, (12.27) 

where 


Qt = y~V(a|S t )g(g f ,ffl,w t -i). 


(12.28) 
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Exercise 12.7 Generalize the three recursive equations above to their truncated versions, defining 
G& and G&. □ 


12.9 Off-policy Eligibility Traces 


The final step is to incorporate importance sampling. Unlike in the case of n-step methods, for full 
non-truncated A-returns one does not have a practical option in which the importance sampling is done 
outside the target return. Instead, we move directly to the bootstrapping generalization of per-reward 
importance sampling (Section 7.4). In the state case, our final definition of the A-return generalizes 
(12.25), after the model of (7.10), to 

Gf S = Pt (Rt+i + 7t+i((l ~ X t +i)v{S t +i,'Wt) + At+iG^+i)) + (1 — Pt)v(S t ,wt) (12.29) 

where pt = ^Apsl) the usua l single-step importance sampling ratio. Much like the other returns we 
have seen in this book, the truncated version of this return can be approximated simply in terms of 
sums of the state-based TD error, 


St = R t+ i + 7 t+1 D(S' t+ i,w t ) - v(S t , w t ), (12.30) 

as 

oo k 

G} s *v(S t ,w t ) + p t J2Sk II (12-31) 

k=t i=t +1 

with the approximation becoming exact if the approximate value function does not change. 

Exercise 12.8 Prove that (12.31) becomes exact if the value function does not change. To save writing, 
consider the case of t = 0, and use the notation 14 = v(Sk, w). □ 

Exercise 12.9 The truncated version of the general off-policy return is denoted Gp S h . Guess the correct 
equation, based on (12.31). □ 

The above form of the A-return (12.31) is convenient to use in a forward-view update, 

w t+ i = w t + a ( Gt s - v(S t , w t )) Vv(S t ,w t ) 

( oo k \ 

n TjAift I VD(5 t ,w t ), 

k—t i=t -\-1 / 

which to the experienced eye looks like an eligibility-based TD update—the product is like an eligibility 
trace and it is multiplied by TD errors. But this is just one time step of a forward view. The relationship 
that we are looking for is that the forward-view update, summed over time, is approximately equal to 
a backward-view update, summed over time (this relationship is only approximate because again we 
ignore changes in the value function). The sum of the forward-view update over time is 

oo oo oo k 

(w t+ i - w t ) 

t—1 t —1 k—t i=t+1 

oo k k 

= '52'52ap t Vv(S t ,w t )5 s k ^iXiPi 

fc=1 t -1 i—t+l 

(using the summation rule: ELt = Efc=* Et=x) 

oo k k 

= '52®$k'52Pt'Vv(S t ,w t ) P 7 iXiPi, 
fc=1 t =1 i=t+1 
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which would be in the form of the sum of a backward-view TD update if the entire expression from the 
second sum left could be written and updated incrementally as an eligibility trace, which we now show 
can be done. That is, we show that if this expression was the trace at time k , then we could update it 
from its value at time k — 1 by: 

k k 

z k = '^2ptVv(S tl 'w t ) liKpi 

t=l i=t +1 

k— 1 k 

= ^ p t Vv{S t ,w t ) 7 4 A iPi + p k X7v(S k , w fe ) 

t—1 i=t -\-1 

k- 1 k- 1 

= Ik^kPk PtVvjSt, W t ) j i Xip l + p k Vv(S k ,w k ) 

t= 1 *=£+1 

S -V-' 

Zfc-1 

= Pk{lk^kZ k -i + V0(S fc , Wfe)), 

which, changing the index from k to t, is the general accumulating trace update for state values: 

z * = Pt (it^tZt-i + Vv(St,w t )), (12.32) 

This eligibility trace, together with the usual semi-gradient parameter-update rule for TD(A) (12.7), 
forms a general TD(A) algorithm that can be applied to either on-policy or off-policy data. In the 
on-policy case, the algorithm is exactly TD(A) because p t is alway 1 and (12.32) becomes the usual 
accumulating trace (12.5) (extended to variable A and 7). In the off-policy case, the algorithm often 
works well but, as an semi-gradient method, is not guaranteed to be stable. In the next few sections 
we will consider extensions of it that do guarantee stability. 

A very similar series of steps can be followed to derive the off-policy eligibility traces for action-value 
methods and corresponding general Sarsa(A) algorithms. One could start with either recursive form 
for the general action-based A-return, (12.26) or (12.27), but the former works out to be simpler. We 
extend (12.26) to the off-policy case after the model of (7.11) to produce 

= Rt+ 1 + 7t+i ((1 — X t +i)(pt+iq(St+i, At + i, w t ) + (1 — p t +i)Qt+i) 

+ X t +i(pt+iGt+i + (1 — Pt+i)Qt.+i)^ 

= Rt +1 + 7t+i ((1 ~ X t +i)pt+iq(St+i, A t+ i, w t ) + At+ipt+iGt+i + (1 — pt+i)Qt+i)j (12.33) 

where Qt+i is as given by (12.28). Again the A-return can be written approximately as the sum of TD 
errors, 


00 k 

G} a ^q(S t ,A t , W t ) + £<S£ n 7 i A i*. ( 12 - 34 ) 

k—t i=t -\-1 

using a novel form of the TD error: 

<5“ = Rt+i + 7t+i (pt+idC-S’t+i, A t+1 ,w t ) + (1 — p t +i)Qt+ij — q{S t ,A t , w t ). (12.35) 

As before, the approximation becomes exact if the approximate value function does not change. 

Exercise 12.10 Prove that (12.34) becomes exact if the value function does not change. To save 
writing, consider the case of t = 0, and use the notation Q k = q(S k , A k , w). Hint: Start by writing out 
dp and Gg a , then Gg° — Q 0 . □ 
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Exercise 12.11 The truncated version of the general off-policy return is denoted G ^° h . Guess the 
correct equation, based on (12.34). □ 

Using steps entirely analogous to those for the state case, one can write a forward-view update based 
on (12.34), transform the sum of the updates using the summation rule, and finally derive the following 
form for the eligibility trace for action values: 


z t = + V q(S t , A t , w t ). (12.36) 

This eligibility trace, together with the usual semi-gradient parameter-update rule (12.7), forms a 
general Sarsa(A) algorithm that can be applied to either on-policy or off-policy data. In the on-policy 
case with constant A and 7 this algorithm is identical to the Sarsa(A) algorithm presented in Section 12.7. 
In the off-policy case this algorithm is not stable unless combined with one of the methods presented 
in the following sections. 

Exercise 12.12 Show in detail the steps outlined above for deriving (12.36) from (12.34). Start with 
the update (12.21), substitute G^ a from (12.33) for G^, then follow similar steps as led to (12.32). □ 

*Exercise 12.13 Show how similar steps can be followed starting from the Expected Sarsa form of the 
action-based A-return (12.27) and (12.28) to derive the same eligibility trace algorithm as (12.36), but 
with a different TD error: 

= -ft*+ 1 +7t+i0t+i ~ q(S t ,A t ,w t ) + -y t+1 \t + ipt+i(q(St + i, At+i,w t ) — Qt+i)- □ 

At A = 1, these algorithms become closely related to corresponding Monte Carlo algorithms. One 
might expect that an exact equivalence would hold for episodic problems and off-line updating, but in 
fact the relationship is subtler and slightly weaker than that. Under these most favorable conditions still 
there is not an episode by episode equivalence of updates, only of their expectations. This should not be 
surprising as these method make irrevocable updates as a trajectory unfolds, whereas true Monte Carlo 
methods would make no update for a trajectory if any action within it has zero probability under the 
target policy. In particular, all of these methods, even at A = 1, still bootstrap in the sense that their 
targets depend on the current value estimates—its just that the dependence cancels out in expected 
value. Whether this is a good or bad property in practice is another question. Recently methods have 
been proposed that do achieve an exact equivalence (Sutton, Mahmood, Precup and van Hasselt, 2014). 
These methods require an additional vector of “provisional weights” that keep track of updates which 
have been made but may need to be retracted (or emphasized) depending on the actions taken later. 
The state and state-action versions of these methods are called PTD(A) and PQ(A) respectively, where 
the ‘P’ stands for Provisional. 

The practical consequences of all these new off-policy methods have not yet been established. Un¬ 
doubtedly, issues of high variance will arise as they do in all off-policy methods using importance 
sampling (Section 11.9). 

If A < 1, then all these off-policy algorithms involve bootstrapping and the deadly triad applies (Sec¬ 
tion 11.3), meaning that they can be guaranteed stable only for the tabular case, for state aggregation, 
and for other limited forms of function approximation. For linear and more-general forms of function 
approximation the parameter vector may diverge to infinity as in the examples in Chapter 11. As 
we discussed there, the challenge of off-policy learning has two parts. Off-policy eligibility traces deal 
effectively with the first part of the challenge, correcting for the expected value of the targets, but not 
at all with the second part of the challenge, having to do with the distribution of updates. Algorithmic 
strategies for meeting the second part of the challenge of off-policy learning with eligibility traces are 
summarized in Section 12.11. 


Exercise 12.14 What are the dutch-trace and replacing-trace versions of off-policy eligibility traces 
for state-value and action-value methods? □ 
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12.10 Watkins’s Q(A) to Tree-Backup (A) 


Several methods have been proposed over the years to extend Q-learning to eligibility traces. The 
original is Watkins’s Q(X), which decays its eligibility traces in the usual way as long as a greedy 
action was taken, then cuts the traces to zero after the first non-greedy action. The backup diagram 
for Watkins’s Q(A) is shown in Figure 12.12. In Chapter 6, we unified Q-learning and Expected Sarsa 
in the off-policy version of the latter, which includes Q-learning as a special case, and generalizes it 
to arbitrary target policies, and in the previous section of this chapter we completed our treatment of 
Expected Sarsa by generalizing it to off-policy eligibility traces. In Chapter 7, however, we distinguished 
multi-step Expected Sarsa from multi-step Tree Backup, where the latter retained the property of not 
using importance sampling. It remains then to present the eligibility trace version of Tree Backup, 
which we well call Tree-Backup(X), or TB(\) for short. This is arguably the true successor to Q- 
learning because it retains its appealing absence of importance sampling even though it can be applied 
to off-policy data. 

The concept of TB(A) is straightforward. As shown in its backup diagram in Figure 12.13, the 
tree-backup updates of each length (from Section 7.5) are weighted in the usual way dependent on 
the bootstrapping parameter A. To get the detailed equations, with the right indexes on the general 
bootstrapping and discounting parameters, it is best to start with a recursive form (12.27) for the 
A-return using action values, and then expand the bootstrapping case of the target after the model of 
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Figure 12.12: The backup diagram for Watkins’s Q(A). The series of component updates ends either with the 
end of the episode or with the first nongreedy action, whichever comes first. 
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Tree Backup (A) 
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Figure 12.13: The backup diagram for the A version of the Tree Backup algorithm. 


(7.13): 

Gt a = R t+ 1 + 7 t+1 ^(1 - A t+ i)0t+i + A t+ i( 

X] 7r(a|-5't + i)g(S't + i,a,w t ) + 7r(A t+1 ISt+i^t+i) 

a^At+i 

= Rt +1 + 7t+i ^0*+i + At+i7r(A t+ i|S' t+ i) — q(S t +i, A t+ \, w t )^ 

As per the usual pattern, it can also be written approximately (ignoring changes in the approximate 
value function) as a sum of TD errors, 

oo k 

G} a *q(S t ,A t , w t ) + Y,S a k II (12-37) 

k—t i=t -\-1 

using the expectation form of the action-based TD error: 

St = Rt+i + 7t+ 1 0t+i — A t , w t ). (12.38) 

Following the same steps as in the previous section, we arrive at a special eligibility trace update 
involving the target-policy probabilities of the selected actions, 


zt = 'r t \ t n{A t \S t )z t _ 1 + Vq(S t ,A t , w t ). (12.39) 

This, together with the usual parameter-update rule (12.7), defines the TB(A) algorithm. Like all semi¬ 
gradient algorithms, TB(A) is not guaranteed to be stable when used with off-policy data and with 
a powerful function approximator. For that it would have to be combined with one of the methods 
presented in the next section. 
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Exercise 12.15 (programming) De Asis (personal communication) has proposed a new algorithm 
he calls “Rigid Tree Backup” that combines (12.38) with (12.36) in the usual way (12.7). Implement 
this algorithm and compare it empirically with TB(A) and general Sarsa(A) on the 19-state random 
walk. □ 


12.11 Stable Off-policy Methods with Traces 


Several methods using eligibility traces have been proposed that achieve guarantees of stability under 
off-policy training, and here we present four of the most important using this book’s standard notation, 
including general bootstrapping and discounting functions. All are based on either the Gradient-TD 
or the Emphatic-TD ideas presented in Sections 11.7 and 11.8. All the algorithms assume linear 
function approximation, though extensions to nonlinear function approximation can also be found in 
the literature. 

GTD(X) is the eligibility-trace algorithm analogous to TDC, the better of the two state-value 
Gradient-TD prediction algorithms discussed in Section 11.7. Its goal is to learn a parameter Wj 
such that v(s, w) = w t r x(s) « v w (s) even from data that is due to following another policy b. Its 
update is 

w t+ i = w t + aSf Zt - c* 7 i+1 (l - A t+ i) (z Jv t ) x t+1 , (12.40) 

with Sf , z t , and p t defined in the usual ways for state values (12.30) (12.32) (11.1), and 

v t +i = /36fz t - /? (v7x t ) x t , (12.41) 

where, as in Section 11.7, v 6 K d is a vector of the same dimension as w, initialized to vo = 0, and 
P > 0 is a second step-size parameter. 

GQ(A) is the Gradient-TD algorithm for action values with eligibility traces. Its goal is to learn a 
parameter w t such that q(s,a, w t ) = w Jx(s,a) s=s q n (s,a) from off-policy data. If the target policy 
is e-greedy, or otherwise biased toward the greedy policy for q. then GQ(A) can be used as a control 
algorithm. Its update is 

w t+ i = w t + a8? z t - a 7 t+1 (l - A t+ i) (z^v*) x t+ i, (12.42) 

where x t is the average feature vector for St under the target policy, 

x t = '^2'ir{a\S t )x{S t ,a), (12.43) 

a 

5? is the expectation form of the TD error, which can be written, 

= Rt+ 1 + lt+i w J*t+i - w Jx t , (12.44) 


Zt is defined in the usual ways for action values (12.36), and the rest is as in GTD(A), including the 
update for v t (12.41). 

HTD(X) is a hybrid state-value algorithm combining aspects of GTD(A) and TD(A). Its most ap¬ 
pealing feature is that it is a strict generalization of TD(A) to off-policy learning, meaning that if the 
behavior policy happens to be the same as the target policy, then HTD(A) becomes the same as TD(A), 
which is not true for GTD(A). This is appealing because TD(A) is often faster than GTD(A) when both 
algorithms converge, and TD(A) requires setting only a single step size. HTD(A) is defined by 


w t+ i = -w t + z t + a ((z t - z t) T v t ) (x t - 7 t+1 x t+ i), 
v t +i = v t + PSIz t - P (z* T Vt) (x t - 7 t+1 x t+ i), 

z t = + x *)> 

A. =lAtA-i+*t, 


v 0 = 0 , 

z_i = 0, 

z _i = 0 , 


(12.45) 
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where /3 > 0 again is a second step-size parameter that becomes irrelevant in the on-policy case in which 
b = 7 r. In addition to the second set of weights, v t , HTD(A) also has a second set of eligibility traces, 
Zj. These are a conventional accumulating eligibility trace for the behavior policy and become equal to 
z t if all the pt are 1 , which causes the second term in the w t update to be zero and the overall update 
to reduce to TD(A). 

Emphatic TD(X) is the extension of the one-step Emphatic-TD algorithm from Section 11.8 to eligibil¬ 
ity traces. The resultant algorithm retains strong off-policy convergence guarantees while enabling any 
degree of bootstrapping, albeit at the cost of high variance and potentially slow convergence. Emphatic 
TD(A) is defined by 


t+l 

— Ot + ad t z t 



<5t 

— Rt+i + 7 1+ 

A T x t+ i - Ojx t 


z t 


+ Af t x t ), 

with z_i = 0 

M t 

= A t J t + (l- 

A t )F t 


F t 

= Pt-ilft F t-i 

+ hi 

with Fq = z(jS'q), 


where M t > 0 is the general form of emphasis, F t > 0 is termed the followon trace, and I t > 0 is 
the interest , as described in Section 11.8. Note that M t , like S t , is not really an additional memory 
variable. It can be removed from the algorithm by substituting its definition into the eligibility-trace 
equation. Pseudocode and software for the true online version of Emphatic-TD (A) are available on the 
web (Sutton, 2015b). 

In the on-policy case (p t = 0, for all t), Emphatic-TD (A) is similar to conventional TD(A), but 
still significantly different. In fact, whereas Emphatic-TD (A) is guaranteed to converge for all state- 
dependent A functions, TD(A) is not. TD(A) is guaranteed convergent only for all constant A. See Yu’s 
counterexample (Ghiassian, Rafiee, and Sutton, 2016). 


12.12 Implementation Issues 

It might at first appear that methods using eligibility traces are much more complex than one-step 
methods. A naive implementation would require every state (or state-action pair) to update both its 
value estimate and its eligibility trace on every time step. This would not be a problem for implemen¬ 
tations on single-instruction, multiple-data, parallel computers or in plausible neural implementations, 
but it is a problem for implementations on conventional serial computers. Fortunately, for typical values 
of A and 7 the eligibility traces of almost all states are almost always nearly zero; only those that have 
recently been visited will have traces significantly greater than zero. In practice, only these few states 
need to be updated to closely approximate these algorithms. 

In practice, then, implementations on conventional computers may keep track of and update only 
the few states with nonzero traces. Using this trick, the computational expense of using traces is 
typically just a few times that of a one-step method. The exact multiple of course depends on A 
and 7 and on the expense of the other computations. Note that the tabular case is in some sense 
the worst case for the computational complexity of eligibility traces. When function approximation is 
used, the computational advantages of not using traces generally decrease. For example, if artificial 
neural networks and backpropagation are used, then eligibility traces generally cause only a doubling 
of the required memory and computation per step. Truncated A-return methods (Section 12.3) can 
be computationally efficient on conventional computers though they always require some additional 
memory. 
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12.13 Conclusions 

Eligibility traces in conjunction with TD errors provide an efficient, incremental way of shifting and 
choosing between Monte Carlo and TD methods. The atomic multi-step methods of Chapter 7 also 
enabled this, but eligibility trace methods are more general, often faster to learn, and offer different 
computational complexity tradeoffs. This chapter has offered an introduction to the elegant, emerging 
theoretical understanding of eligibility traces for on- and off-policy learning and for variable bootstrap¬ 
ping and discounting. One aspect of this elegant theory is true online methods, which exactly reproduce 
the behavior of expensive ideal methods while retaining the computational congeniality of conventional 
TD methods. Another aspect is the possibility of derivations that automatically convert from intuitive 
forward-view methods to more efficient incremental backward-view algorithms. We illustrated this gen¬ 
eral idea in a derivation that started with a classical, expensive Monte Carlo algorithm and ended with 
a cheap incremental non-TD implementation using the same novel eligibility trace used in true online 
TD methods. 

As we mentioned in Chapter 5, Monte Carlo methods may have advantages in non-Markov tasks 
because they do not bootstrap. Because eligibility traces make TD methods more like Monte Carlo 
methods, they also can have advantages in these cases. If one wants to use TD methods because of 
their other advantages, but the task is at least partially non-Markov, then the use of an eligibility trace 
method is indicated. Eligibility traces are the first line of defense against both long-delayed rewards 
and non-Markov tasks. 

By adjusting A, we can place eligibility trace methods anywhere along a continuum from Monte 
Carlo to one-step TD methods. Where shall we place them? We do not yet have a good theoretical 
answer to this question, but a clear empirical answer appears to be emerging. On tasks with many 
steps per episode, or many steps within the half-life of discounting, it appears significantly better to use 
eligibility traces than not to (e.g., see Figure 12.14). On the other hand, if the traces are so long as to 
produce a pure Monte Carlo method, or nearly so, then performance degrades sharply. An intermediate 
mixture appears to be the best choice. Eligibility traces should be used to bring us toward Monte Carlo 
methods, but not all the way there. In the future it may be possible to vary the trade-off between TD 
and Monte Carlo methods more finely by using variable A, but at present it is not clear how this can 
be done reliably and usefully. 

Methods using eligibility traces require more computation than one-step methods, but in return they 
offer significantly faster learning, particularly when rewards are delayed by many steps. Thus it often 
makes sense to use eligibility traces when data are scarce and cannot be repeatedly processed, as is 
often the case in on-line applications. On the other hand, in off-line applications in which data can be 
generated cheaply, perhaps from an inexpensive simulation, then it often does not pay to use eligibility 
traces. In these cases the objective is not to get more out of a limited amount of data, but simply to 
process as much data as possible as quickly as possible. In these cases the speedup per datum due to 
traces is typically not worth their computational cost, and one-step methods are favored. 

*Exercise 12.16 How might Double Expected Sarsa be extended to eligibility traces? □ 
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Figure 12.14: The effect of A on reinforcement learning performance in four different test problems. In all cases, 
lower numbers represent better performance. The two left panels are applications to simple continuous-state 
control tasks using the Sarsa(A) algorithm and tile coding, with either replacing or accumulating traces (Sutton, 
1996). The upper-right panel is for policy evaluation on a random walk task using TD(A) (Singh and Sutton, 
1996). The lower right panel is unpublished data for the pole-balancing task (Example 3.4) from an earlier 
study (Sutton, 1984). 


Bibliographical and Historical Remarks 

Eligibility traces came into reinforcement learning via the fecund ideas of Klopf (1972). Our use of 
eligibility traces is based on Klopf’s work (Sutton, 1978a, 1978b, 1978c; Barto and Sutton, 1981a, 
1981b; Sutton and Barto, 1981a; Barto, Sutton, and Anderson, 1983; Sutton, 1984). We may have been 
the first to use the term “eligibility trace” (Sutton and Barto, 1981). The idea that stimuli produce 
aftereffects in the nervous system that are important for learning is very old. See Chapter 14. Some of 
the earliest uses of eligibility traces were in the actor-critic methods discussed in Chapter 13 (Barto, 
Sutton, and Anderson, 1983; Sutton, 1984). 

12.1 The A-return and its error-reduction properties were introduced by Watkins (1989) and further 
developed by Jaakkola, Jordan and Singh (1994). The random walk results in this and sub¬ 
sequent sections are new to this text, as are the terms “forward view” and “backward view.” 
The notion of a A-return algorithm was introduced in the first edition of this text. The more 
refined treatment presented here was developed in conjunction with Harm van Seijen (e.g., van 
Seijen and Sutton, 2014). 

12.2 TD(A) with accumulating traces was introduced by Sutton (1988, 1984). Convergence in the 
mean was proved by Dayan (1992), and with probability 1 by many researchers, including 
Peng (1993), Dayan and Sejnowski (1994), and Tsitsiklis (1994) and Gurvits, Lin, and Hanson 
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(1994). The bound on the error of the asymptotic A-dependent solution of linear TD(A) is due 
to Tsitsiklis and Van Roy (1997). 

12.3-5 Truncated TD methods were developed by Cichosz (1995) and van Seijen (2016). True online 
TD(A) and the other ideas presented in these sections are primarily due to work of van Seijen 
(van Seijen and Sutton, 2014; van Seijen et ah, 2016) Replacing traces are due to Singh and 
Sutton (1996). 

12.6 The material in this section is from van Hasselt and Sutton (2015). 

12.7 Sarsa(A) with accumulating traces was first explored as a control method by Rummery and 
Niranjan (1994; Rummery, 1995). True Online Sarsa(A) was introduced by van Seijen and 
Sutton (2014). The algorithm on page 252 was adapted from van Seijen et al. (2016). The 
Mountain Car results were made new for this text, except for Figure 12.11 which is adapted 
from van Seijen and Sutton (2014). 

12.8 Perhaps the first published discussion of variable A was by Watkins (1989), who pointed out 
that the cutting off of the update sequence (Figure 12.12) in his Q(A) when a nongreedy action 
was selected could be implemented by temporarily setting A to 0. 

Variable A was introduced in the first edition of this text. The roots of variable 7 are in 
the work on options (Sutton, Precup, and Singh, 1999) and its precursors (Sutton, 1995a), 
becoming explicit in the GQ(A) paper (Maei and Sutton, 2010), which also introduced some of 
these recursive forms for the A-returns. 

A different notion of variable A has been developed by Yu (2012). 

12.9 Off-policy eligibility traces were introduced by Precup et al. (2000, 2001), then further developed 
by Bertsekas and Yu (2009), Maei (2011; Maei and Sutton, 2010), Yu (2012), and by Sutton, 
Mahmood, Precup, and van Hasselt (2014). The latter reference in particular gives a powerful 
forward view for off-policy TD methods with general state-dependent A and 7 . The presentation 
here seems to be new. 

12.10 Watkins’s Q(A) is due to Watkins (1989). Convergence has still not been proved for any control 
method for 0 < A < 1. Tree Backup(A) is due to Precup, Sutton, and Singh (2000). 

12.11 GTD(A) is due to Maei (2011). GQ(A) is due to Maei and Sutton (2010). HTD(A) is due 
to White and White (2016) based on the one-step HTD algorithm introduced by Hackman 
(2012). Emphatic TD(A) was introduced by Sutton, Mahmood, and White (2016), who proved 
its stability, then was proved to be convergent by Yu (2015a,b), and developed further by 
Hallak, Tamar, Munos, and Mannor (2016). 
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Chapter 13 


Policy Gradient Methods 


In this chapter we consider something new. So far in this book almost all the methods have learned the 
values of actions and then selected actions based on their estimated action values 1 ; their policies would 
not even exist without the action-value estimates. In this chapter we consider methods that instead 
learn a parameterized policy that can select actions without consulting a value function. A value function 
may still be used to learn the policy parameter, but is not required for action selection. We use the 
notation 9 <E for the policy’s parameter vector. Thus we write n(a\s,9) = Pr{A t = a | St = s, 9 t =9} 
for the probability that action a is taken at time t given that the environment is in state s at time t 
with parameter 9. If a method uses a learned value function as well, then the value function’s weight 
vector is denoted w € as usual, as in v(s, w). 

In this chapter we consider methods for learning the policy parameter based on the gradient of some 
performance measure J(9) with respect to the policy parameter. These methods seek to maximize 
performance, so their updates approximate gradient ascent in J: 

9 t +i = 9 t +aVJ(9 t ), (13.1) 

where S7J(9 t ) is a stochastic estimate whose expectation approximates the gradient of the performance 
measure with respect to its argument 9 t . All methods that follow this general schema we call policy 
gradient methods , whether or not they also learn an approximate value function. Methods that learn 
approximations to both policy and value functions are often called actor-critic methods , where ‘actor’ 
is a reference to the learned policy, and ‘critic’ refers to the learned value function, usually a state-value 
function. First we treat the episodic case, in which performance is defined as the value of the start state 
under the parameterized policy, before going on to consider the continuing case, in which performance is 
defined as the average reward rate, as in Section 10.3. In the end we are able to express the algorithms 
for both cases in very similar terms. 


13.1 Policy Approximation and its Advantages 

In policy gradient methods, the policy can be parameterized in any way, as long as n(a\s,9) is differ¬ 
entiable with respect to its parameters, that is, as long as \7gn(a\s,9) exists and is always finite. In 
practice, to ensure exploration we generally require that the policy never becomes deterministic (i.e., 
that 7r(o|s, 9) £ (0,1), for all s,a,9). In this section we introduce the most common parameterization 

1 The lone exception is the gradient bandit algorithms of Section 2.8. In fact, that section goes through many of the 
same steps, in the single-state bandit case, as we go through here for full MDPs. Reviewing that section would be good 
preparation for fully understanding this chapter. 
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for discrete action spaces and point out the advantages it offers over action-value methods. Policy- 
based methods also offer useful ways of dealing with continuous action spaces, as we describe later in 
Section 13.7. 


If the action space is discrete and not too large, then a natural kind of parameterization is to form 
parameterized numerical preferences h(s, a, 9) £ R for each state-action pair. The actions with the 
highest preferences in each state are given the highest probabilities of being selected, for example, 
according to an exponential softmax distribution: 


7r(a|s, 9) 


exp (h(s, a, 9)) 
J2 b exp(M s > b, 9)) ’ 


(13.2) 


where exp(:r) = e x , where e ss 2.71828 is the base of the natural logarithm. Note that the denominator 
here is just what is required so that the action probabilities in each state sum to one. The preferences 
themselves can be parameterized arbitrarily. For example, they might be computed by a deep neural 
network, where 9 is the vector of all the connection weights of the network (as in the AlphaGo system 
described in Section 16.6). Or the preferences could simply be linear in features, 

h(s , a, 9) = 0 T x(s, a), (13.3) 

using feature vectors x(s, a) € constructed by any of the methods described in Chapter 9. 

An immediate advantage of selecting actions according to the softmax in action preferences (13.2) is 
that the approximate policy can approach a deterministic policy, whereas with e-greedy action selection 
over action values there is always an e probability of selecting a random action. Of course, one could 
select according to a softmax over action values, but this alone would not allow the policy to approach 
a deterministic policy. Instead, the action-value estimates would converge to their corresponding true 
values, which would differ by a finite amount, translating to specific probabilities other than 0 and 1. 
If the softmax included a temperature parameter, then the temperature could be reduced over time 
to approach determinism, but in practice it would be difficult to choose the reduction schedule, or 
even the initial temperature, without more prior knowledge of the true action values than we would 
like to assume. Action preferences are different because they do not approach specific values; instead 
they are driven to produce the optimal stochastic policy. If the optimal policy is deterministic, then 
the preferences of the optimal actions will be driven infinitely higher than all suboptimal actions (if 
permited by the parameterization). 

In problems with significant function approximation, the best approximate policy may be stochastic. 
For example, in card games with imperfect information the optimal play is often to do two different 
things with specific probabilities, such as when bluffing in Poker. Action-value methods have no natural 
way of finding stochastic optimal policies, whereas policy approximating methods can, as shown in 
Example 13.1. This is a second significant advantage of policy-based methods. 


Exercise 13.1 Use your knowledge of the gridworld and its dynamics to determine an exact symbolic 
expression for the optimal probability of selecting the right action in Example 13.1. 

Perhaps the simplest advantage that policy parameterization may have over action-value parameter¬ 
ization is that the policy may be a simpler function to approximate. Problems vary in the complexity 
of their policies and action-value functions. For some, the action-value function is simpler and thus 
easier to approximate. For others, the policy is simpler. In the latter case a policy-based method will 
typically learn faster and yield a superior asymptotic policy (as seems to be the case with Tetris; see 
§im§ek, Algorta, and Kothiyal, 2016). 

Finally, we note that the choice of policy parameterization is sometimes a good way of injecting prior 
knowledge about the desired form of the policy into the reinforcement learning system. This is often 
the most important reason for using a policy-based learning method. 
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Example 13.1 Short corridor with switched actions 


Consider the small corridor gridworld shown inset in the graph below. The reward is —1 per 
step, as usual. In each of the three nonterminal states there are only two actions, right and left. 
These actions have their usual consequences in the first and third states (left causes no movement 
in the first state), but in the second state they are reversed, so that right moves to the left and 
left moves to the right. The problem is difficult because all the states appear identical under the 
function approximation. In particular, we define x(s, right) = [1, 0] T and x(s, left) = [0,1] T , for 
all s. An action-value method with e-greedy action selection is forced to choose between just 
two policies: choosing right with high probability 1 — e/2 on all steps or choosing left with the 
same high probability on all time steps. If £ = 0.1, then these two policies achieve a value (at 
the start state) of less than —44 and —82, respectively, as shown in the graph. A method can 
do significantly better if it can learn a specific probability with which to select right. The best 
probability is about 0.59, which achieves a value of about —11.6. 
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13.2 The Policy Gradient Theorem 

In addition to the practical advantages of policy parameterization over e-greedy action selection, there 
is also an important theoretical advantage. With continuous policy parameterization the action proba¬ 
bilities change smoothly as a function of the learned parameter, whereas in e-greedy selection the action 
probabilities may change dramatically for an arbitrarily small change in the estimated action values, 
if that change results in a different action having the maximal value. Largely because of this stronger 
convergence guarantees are available for policy-gradient methods than for action-value methods. In 
particular, it is the continuity of the policy dependence on the parameters that enables policy-gradient 
methods to approximate gradient ascent (13.1). 

The episodic and continuing cases define the performance measure, J{0), differently and thus have to 
be treated separately to some extent. Nevertheless, we will try to present both cases uniformly, and we 
develop a notation so that the major theoretical results can be decribed with a single set of equations. 

In this section we treat the episodic case, for which we define the performance measure as the value 
of the start state of the episode. We can simplify the notation without losing any meaningful generality 
by assuming that every episode starts in some particular (non-random) state Sq. Then, in the episodic 
case we define performance as 


J(0)=v Vg (s o ), (13.4) 

where v vg is the true value function for ttq, the policy determined by Q. From here on in our discussion 
we will assume no discounting (7 = 1 ) for the episodic case, although for completeness we do include 
the possibility of discounting in the boxed algorithms. 

With function approximation, it may seem challenging to change the policy parameter in a way that 
ensures improvement. The problem is that performance depends on both the action selections and the 
distribution of states in which those selections are made, and that both of these are affected by the 
policy parameter. Given a state, the effect of the policy parameter on the actions, and thus on reward, 
can be computed in a relatively straightforward way from knowledge of the parameterization. But the 
effect of the policy on the state distribution is a function of the environment and is typically unknown. 
How can we estimate the performance gradient with respect to the policy parameter when the gradient 
depends on the unknown effect of policy changes on the state distribution? 

Fortunately, there is an excellent theoretical answer to this challenge in the form of the policy gradient 
theorem , which provides us an analytic expression for the gradient of performance with respect to the 
policy parameter (which is what we need to approximate for gradient ascent (13.1)) that does not 
involve the derivative of the state distribution. The policy gradient theorem establishes that 

V J(0) oc E Ms)E 9 ’ r ( s,a ) Ve 7 r ( a l s ’ 0 )’ ( 13 - 5 ) 

s a 

where the gradients are column vectors of partial derivatives with respect to the components of 6 , and 
7 r denotes the policy corresponding to parameter vector 6. The symbol oc here means “proportional 
to”. In the episodic case, the constant of proportionality is the average length of an episode, and in the 
continuing case it is 1, so that the relationship is actually an equality. The distribution p here (as in 
Chapters 9 and 10) is the on-policy distribution under tt (see page 163). The policy gradient theorem 
is proved for the episodic case in the box on the next page. 
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Proof of the Policy Gradient Theorem (episodic case) 


With just elementary calculus and re-arranging terms we can prove the policy gradient theorem 
from first principles. To keep the notation simple, we leave it implicit in all cases that n is a 
function of 0, and all gradients are also implicitly with respect to 6. First note that the gradient 
of the state-value function can be written in terms of the action-value function as 


Vr T (s) = V 


^7r(a|s)g w (s,a) 


for all s £ S (Exercise 3.15) 

= ^[ v 7i-(a|s)^(s,a) + 7r(a|s)Vg 7r (s, a)J (product rule) 

a 

= ^^V7r(a|s)g^(s,a) + 7r(a|s)V Ep( s '» r \ s, a) (r + v w (s')) 

a s' ,r 

(Exercise 3.16 and Equation 3.2) 
= E [V7r(a|s)<7,r(s, a) + 7r(a|s) ^p(s / |s, a)Vv 7r (s') (Eq. 3.4) 

a s' 

= ^[ v 7i-(a|s)g^(s,a) + 7r(a|s) ^^p(s' | s, a) (unrolling) 

a s' 

E [V7r(a'| s')q n (s', a’) + n(a'\s') Ep( s " I s/ ’ a , )Vii w (s")] 

a' s" 

oo 

= E E Pr ( s_i>a: ’ n ) E V7r(a|a;)g 7r (a;, a), 


x£S k —0 


after repeated unrolling, where Pr(s—>- 2 :, k, w) is the probability of transitioning from state s to 
state x in k steps under policy ir. It is then immediate that 

VJ(0) = Vv w (s 0 ) 

Pr(so — >s, k, 7r) 

= E r ?( s )E V7r(a|s)g w (s, a) (box page 163) 

s a 

= E E ?( g s) E V 7 r ( q |s)^(s,a) 

oc'^2 n(s)’^^Vir(a\s)q 7I (s, a). Q.E.D. 


E v -( S^Qir (^5 &) 



(Eq. 9.3) 
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13.3 REINFORCE: Monte Carlo Policy Gradient 


We are now ready for our first policy-gradient learning algorithm. Recall our overall strategy of stochas¬ 
tic gradient ascent (13.1), which requires a way to obtain samples such that the expectation of the 
sample gradient is proportional to the actual gradient of the performance measure as a function of the 
parameter. The sample gradients need only be proportional to the gradient because any constant of 
proportionality can be absorbed into the step size a, which is otherwise arbitrary. The policy gradient 
theorem gives an exact expression proportional to the gradient; all that is needed is some way of sam¬ 
pling whose expectation equals or approximates this expression. Notice that the right-hand side of the 
policy gradient theorem is a sum over states weighted by how often the states occur under the target 
policy 7 r; if tt is followed, then states will be encountered in these proportions. Thus 


V J(0) oc E M s ) E a )V97r(a|s, 0), 

s a 


= E 


7T 


E q*{S t , a)V e ir(a\S t ,0) 


(13.5) 


This is good progress, and we would like to carry it further and handle the action in the same way 
(replacing a with the sample action At). The remaining part of the expectation above is a sum over 
actions; if only each term were weighted by the probability of selecting the actions, that is, according 
to n(a\St, 9), then the replacement could be done. We can arrange for this by multiplying and dividing 
by this probability. Continuing from the previous equation, this gives 


V J(6») = E„ 
= E„ 
= E„ 


t i e (q x N76»7r(a| S t ,9) 

> 7r(a|S t , 9)q n (S t , a) - , 

„ 7r(a|i>t,0) 


V e n{A t \S u 0) 
x(A t \St,0) 
S7 9 ir(A t \S t ,6y 


q-rr(St, At) 

G t 


n(A t \S t , 6) 


(replacing a by the sample A t ~ tt) 
(because E v [Gt\S t ,A t ] = q n {S t ,A t )) 


where Gt is the return as usual. The final expression in the brackets is exactly what is needed, a 
quantity that can be sampled on each time step whose expectation is equal to the gradient. Using this 
sample to instantiate our generic stochastic gradient ascent algorithm (13.1), yields the update 


9 t+ 1 = 


9 t + aG t 


X7 e n{A t \S t ,9 t ) 

TT(A t \S t ,6 t ) 


(13.6) 


We call this algorithm REINFORCE (after Williams, 1992). Its update has an intuitive appeal. Each 
increment is proportional to the product of a return Gt and a vector, the gradient of the probability 
of taking the action actually taken divided by the probability of taking that action. The vector is the 
direction in parameter space that most increases the probability of repeating the action A t on future 
visits to state S t . The update increases the parameter vector in this direction proportional to the 
return, and inversely proportional to the action probability. The former makes sense because it causes 
the parameter to move most in the directions that favor actions that yield the highest return. The latter 
makes sense because otherwise actions that are selected frequently are at an advantage (the updates 
will be more often in their direction) and might win out even if they do not yield the highest return. 

Note that REINFORCE uses the complete return from time t, which includes all future rewards up 
until the end of the episode. In this sense REINFORCE is a Monte Carlo algorithm and is well defined 
only for the episodic case with all updates made in retrospect after the episode is completed (like the 
Monte Carlo algorithms in Chapter 5). This is shown explicitly in the boxed pseudocode below. 
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Figure 13.1 shows the performance of REINFORCE, averaged over 100 runs, on the gridworld from 
Example 13.1. 

The vector in the REINFORCE update is the only place the policy parameterization 

appears in the algorithm. This vector has been given several names and notations in the literature; 
we will refer to it simply as the eligibility vector. The eligibility vector is often written in the compact 
form V 0 ln7r(A t |S' t , 0 t ), using the identity Vina; = This form is used in all the boxed pseudocode 
in this chapter. In earlier examples in this chapter we considered exponential softmax policies (13.2) 
with linear action preferences (13.3). For this parameterization, the eligibility vector is 

Ve ln7r(a|s, 6 ) = x(s, a) — ^ tt(6|s, 0)x(s, b). (13.7) 

b 

As a stochastic gradient method, REINFORCE has good theoretical convergence properties. By 
construction, the expected update over an episode is in the same direction as the performance gradient. 2 
This assures an improvement in expected performance for sufficiently small a, and convergence to a 

2 Technically, this is only true if each episode’s updates are done off-line, meaning they are accumulated on the side 
during the episode and only used to change 9 by their sum at the episode’s end. However, this would probably be a 
worse algorithm in practice, and its desirable theoretical properties would probably be shared by the algorithm as given 
(although this has not been proved). 



Episode 


Figure 13.1: REINFORCE on the short-corridor gridworld (Example 13.1). With a good step size, the total 
reward per episode approaches the optimal value of the start state. 
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local optimum under standard stochastic approximation conditions for decreasing a. However, as a 
Monte Carlo method REINFORCE may be of high variance and thus produce slow learning. 

Exercise 13.2 Prove (13.7) using the definitions and elementary calculus. □ 


13.4 REINFORCE with Baseline 


The policy gradient theorem (13.5) can be generalized to include a comparison of the action value to 
an arbitrary baseline b(s): 

VJ(0) oc '^2fj,( s )'^2(q ir (8, a) - b{s)j V e 7r(a|s, G). (13.8) 

s a 

The baseline can be any function, even a random variable, as long as it does not vary with a; the 
equation remains valid because the subtracted quantity is zero: 

5(s)Ve7r(a|s, 0) = 5(s)Ve 7r(a|s, Q) = 6(s)Vel = 0. 


The policy gradient theorem with baseline (13.8) can be used to derive an update rule using similar 
steps as in the previous section. The update rule that we end up with is a new version of REINFORCE 
that includes a general baseline: 


d t+1 = O t + a(G t -b{S t )) 


v 0 7T(A t \s t ,e t ) 

n(A t \S u 0 t ) 


(13.9) 


Because the baseline could be uniformly zero, this update is a strict generalization of REINFORCE. In 
general, the baseline leaves the expected value of the update unchanged, but it can have a large effect 
on its variance. For example, we saw in Section 2.8 that an analogous baseline can significantly reduce 
the variance (and thus speed the learning) of gradient bandit algorithms. In the bandit algorithms the 
baseline was just a number (the average of the rewards seen so far), but for MDPs the baseline should 
vary with state. In some states all actions have high values and we need a high baseline to differentiate 
the higher valued actions from the less highly valued ones; in other states all actions will have low values 
and a low baseline is appropriate. 

One natural choice for the baseline is an estimate of the state value, v(St 1 'w), where w £ R m is a 
weight vector learned by one of the methods presented in previous chapters. Because REINFORCE is 
a Monte Carlo method for learning the policy parameter, 6, it seems natural to also use a Monte Carlo 
method to learn the state-value weights, w. A complete pseudocode algorithm for REINFORCE with 
baseline is given in the box on the next page using such a learned state-value function as the baseline. 

This algorithm has two step sizes, denoted a 9 and a w (where or is the a in (13.9)). The step size 
for values (here a w ) is relatively easy; in the linear case we have rules of thumb for setting it, such as 
a w = 0.1/E[||V w «(S t ,w)||^]. It is much less clear how to set the step size a 9 for the policy parameters. 
It depends on the range of variation of the rewards and on the policy parameterization. 
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Figure 13.2 compares the behavior of REINFORCE with and without a baseline on the short-corridor 
gridword (Example 13.1). Here the approximate state-value function used in the baseline is D(s,w) = w. 
That is, w is a single component, w. The step size used for the baseline was j3 = 0.1. 


Total reward 
per episode 

Go 



^*(s 0 ) 


Figure 13.2: Adding a baseline to REINFORCE can make it learn much faster, as illustrated here on the 
short-corridor gridworld (Example 13.1). The step size used here for plain REINFORCE is that at which it 
performs best (to the nearest power of two; see Figure 13.1). Each line is an average over 100 independent runs. 


13.5 Actor—Critic Methods 

Although the REINFORCE-with-baseline method learns both a policy and a state-value function, we 
do not consider it to be an actor-critic method because its state-value function is used only as a 
baseline, not as a critic. That is, it is not used for bootstrapping (updating the value estimate for 
a state from the estimated values of subsequent states), but only as a baseline for the state whose 
estimate is being updated. This is a useful distinction, for only through bootstrapping do we introduce 
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bias and an asymptotic dependence on the quality of the function approximation. As we have seen, 
the bias introduced through bootstrapping and reliance on the state representation is often beneficial 
because it reduces variance and accelerates learning. REINFORCE with baseline is unbiased and 
will converge asymptotically to a local minimum, but like all Monte Carlo methods it tends to learn 
slowly (produce estimates of high variance) and to be inconvenient to implement online or for continuing 
problems. As we have seen earlier in this book, with temporal-difference methods we can eliminate these 
inconveniences, and through multi-step methods we can flexibly choose the degree of bootstrapping. In 
order to gain these advantages in the case of policy gradient methods we use actor-critic methods with 
a bootstrapping critic. 

First consider one-step actor-critic methods, the analog of the TD methods introduced in Chapter 6 
such as TD(0), Sarsa(O), and Q-learning. The main appeal of one-step methods is that they are fully 
online and incremental, yet avoid the complexities of eligibility traces. They are a special case of the 
eligibility trace methods, and not as general, but easier to understand. One-step actor-critic methods 
replace the full return of REINFORCE (13.9) with the one-step return (and use a learned state-value 
function as the baseline) as follows: 


9t +i — 6t + ot(^Gf.t +i — v(St,~w)j 


— 6t + aSt 


Vg njAtlSuOt) 

n{A t \S t ,e t ) ' 


X7 e n(A t \S t ,9 t ) 
n(A t \S t ,9 t ) 

(13.10) 

, . rc A^7r(A t \S t ,9 t ) 

) v(S t , w) 

/ TT(A t \St,9 t ) 

(13.11) 


(13.12) 


The natural state-value-function learning method to pair with this is semi-gradient TD(0). Pseudocode 
for the complete algorithm is given in the box below. Note that it is now a fully online, incremental 
algorithm, with states, actions, and rewards processed as they occur and then never revisited. 


One-step Actor-Critic (episodic) 


Input: a differentiable policy parameterization 7r(a|s,0) 

Input: a differentiable state-value parameterization D(s,w) 

Parameters: step sizes a e > 0, <a w > 0 

Initialize policy parameter 9 G and state-value weights w G 
Repeat forever: 

Initialize S (first state of episode) 

/ -(— 1 

While S is not terminal: 

A ~ 7t(-|5, 9) 

Take action A , observe S', R 

5 <— R + 'yv(S',w) — v(S, w) (if S' is terminal, then f;(5",w) = 0) 
w i — w T a w IS V w v(S',w) 

9^9 + a e I5V e \mr{A\S, 9) 

I <- 7 / 

S^ S' 


The generalizations to the forward view of multi-step methods and then to a A-return algorithm are 
straightforward. The one-step return in (13.10) is merely replaced by G^. t+k and G$ respectively. The 
backward views are also straightforward, using separate eligibility traces for the actor and critic, each 
after the patterns in Chapter 12. Pseudocode for the complete algorithm is given in the box below. 
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Actor-Critic with Eligibility Traces (episodic) 


Input: a differentiable policy parameterization 7r(a|s, 9) 

Input: a differentiable state-value parameterization v(s,w) 

Parameters: trace-decay rates A 0 € [0, 1], A w G [0, 1]; step sizes a e > 0, a w > 0 

Initialize policy parameter 9 G and state-value weights w G 
Repeat forever (for each episode): 

Initialize S (first state of episode) 

z e -f- 0 (cf-component eligibility trace vector) 

z w i — 0 (d-component eligibility trace vector) 

I <r- 1 

While S is not terminal (for each time step): 

A ~ n(-\S, 9) 

Take action A , observe S', R 

S «— R + r yv(S',w) — v(S, w) (if S' is terminal, then v(S', w) = 0) 
z w g- 7 A w z w + JV w «(5,w) 
z e <— 7 A V + JVe lmr(A|5, 9) 
w 4 — w T a w dz w 
9 ^9 + a e 6z e 
I e- 7 J 
S^ S' 


13.6 Policy Gradient for Continuing Problems 

As discussed in Section 10.3, for continuing problems without episode boundaries we need to define 
performance in terms of the average rate of reward per time step: 

1 J 1 

J{9) = r{ 7r) = lim -VE[U t | A 0:t _ i ~ tt\ (13.13) 

h —>-oo ri z ' 

£=1 

= lim E [R t I A 0:i _i ~ 7r] 

£—>■00 

= 53Ms)53M«|s)53/MVka)u 

s a s',r 

where /z is the steady-state distribution under 7r, /i(s) = Hindoo Pr{Si = s|Ao : * ~ 7r}, which is assumed 
to exist and to be independent of So (an ergodicity assumption). Remember that this is the special 
distribution under which, if you select actions according to 7r, you remain in the same distribution: 

53 mm 53 7r(a|s, 9)p(s' \ s, a) = n(s'). (13.14) 

s a 

We also define values, v^(s) = E^[G t |5 t = s] and q n (s,a) = E^[G t |S' t = s,A t = a], with respect to the 
differential return: 

G t = Rt+i — ^(tt) + R t +2-il( 7r ) + Rt+ 3-vW + •••• (13.15) 

With these alternate definitions, the policy gradient theorem as given for the episodic case (13.5) 
remains true for the continuing case. A proof is given in the box on the next page. The forward and 
backward view equations also remain the same. Complete pseudocode for the actor-critic algorithm in 
the continuing case (backward view) is given in the box on page 277. 
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Proof of the Policy Gradient Theorem (continuing case) 


The proof of the policy gradient theorem for the continuing case begins similarly to the episodic 
case. Again we leave it implicit in all cases that 7r is a function of 0 and that the gradients 
are with respect to 0. Recall that in the continuing case J{9) = r(ir) (13.13) and that v„ and 
q n denote values with respect to the differential return (13.15). The gradient of the state-value 
function can be written, for any s € §, as 


Vr T (s) = V 


^7r(a|s)g w (s,a) 


for all s € § 

T: JV7r(a|s)g„.(s, a) + n(a\s)'Vq w (s, a) 

a 

y^[^V7r(a|s)g 7r (s,a) + 7r(a|s)V yp(s', r | s, a) (r - r(0) +u w (s')) 

a s' ,r 

^ [V7r(a|s)<2v(s, a) + 7r(a|s) [-Vr(0) + yp(s' | s, a)Vtv(s')] 


(Exercise 3.15) 
(product rule of calculus) 


After re-arranging terms, we obtain 

Vr(0) = |^V7r(a|s)g w (s, a) + 7r(a|s) ^p(s' | s, a)Vv T (s') 


- Vv„{s). 


Notice that the left-hand side can be written VJ(0) and that it does not depend on s. Thus the 
right-hand side does not depend on s either, and we can safely sum it over all s £ §, weighted 
by /i(s), without changing it (because ]T) S /x(s) = !)• Thus 


VJ(0) = ^/u(s)^|^V7r(a|s)g 7r (s,a) + n{a\s)yp(s'\s, a)Vv n (s') 

s a s' 

= y M s ) y Vn(a\s)q n (s, a) 

s a 

+Ks)y^( a \s)y p(s | S, Cl) V'L’tj-( s ) /i(s) V'Utj-( s) 

a s' cl 

= y MOO y V7r(a|s)g w (s, a) 

s a 

+yy M(s) y n{a\s)p(s' \ s, a) Vv v (s') - y p{s)Vv n {s) 


- Vn 7r (s) 


li(s’) (13.14) 

= y m ( s ) y V7r(a|s)^(s, a) + y p(s')Vv^(s') - y (i(s)V^(s) 

s a s' s 

= y M s ) y V7r(a|s)g w (s, a). 


Q.E.D. 
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Actor-Critic with Eligibility Traces (continuing) 


Input: a differentiable policy parameterization 7r(a|s, 9) 

Input: a differentiable state-value parameterization D(s,w) 

Parameters: trace-decay rates X 9 € [0,1], A w G [0,1]; step sizes a 9 > 0, a w > 0, rj > 0 

z 9 <— 0 (d'-conrponent eligibility trace vector) 
z w •<— 0 (d-component eligibility trace vector) 

Initialize R e K. (e.g., to 0) 

Initialize policy parameter 9 £ and state-value weights w G (e.g., to 0) 
Initialize S € S (e.g., to So) 

Repeat forever: 

A~tt(-|S,0) 

Take action A, observe S ', R 

6 <— R — R + v(S', w) — v(S, w) (if S' is terminal, then v(S', w) = 0) 

R <— R + r]5 

z w <- A w z w + V w fi(5,w) 
z e i— X 0 z 9 + V 9 \mr{A\S,0) 
w ■<— w + a w dz w 
0<- 9 + a 0 5z 9 
S <- S' 


13.7 Policy Parameterization for Continuous Actions 


Policy-based methods offer practical ways of dealing with large actions spaces, even continuous spaces 
with an infinite number of actions. Instead of computing learned probabilities for each of the many 
actions, we instead learn statistics of the probability distribution. For example, the action set might be 
the real numbers, with actions chosen from a normal (Gaussian) distribution. 

The probability density function for the normal distribution is conventionally written 


where /i and a here are the mean and standard deviation of the normal distribution, and of course 7r 
here is just the number 7r ss 3.14159. The probability density functions for several different means and 
standard deviations are shown in Figure 13.3. The value p(x) is the density of the probability at x , not 
the probability. It can be greater than 1; it is the total area under p(x) that must sum to 1. In general, 
one can take the integral under p(x) for any range of x values to get the probability of x falling within 
that range. 

To produce a policy parameterization, the policy can be defined as the normal probability den¬ 
sity over a real-valued scalar action, with mean and standard deviation given by parametric function 
approximators that depend on the state. That is, 


7r(a|s, 9) 


1 

- i= exp 

er(s, 9)V2n 


(a-VL(*,9)) 2 \ 

2 a(s,9)* )' 


(13.17) 


where p, : § x —> K and : S x —> R + are two parameterized function approximators. To 

complete the example we need only give a form for these approximators. For this we divide the policy’s 
parameter vector into two parts, 9 = [0^, 9 cr ] T , one part to be used for the approximation of the mean 
and one part for the approximation of the standard deviation. The mean can be approximated as a 
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Figure 13.3: The probability density function of the normal distribution for different means and variances. 


linear function. The standard deviation must always be positive and is better approximated as the 
exponential of a linear function. Thus 

n(s, 6) = 0 At T x(s) and er(s, 0) = exp^0 ff T x(s)^ , (13.18) 

where x(s) is a state feature vector perhaps constructed by one of the methods described in Chapter 9. 
With these definitions, all the algorithms described in the rest of this chapter can be applied to learn 
to select real-valued actions. 

Exercise 13.3 A Bernoulli-logistic unit is a stochastic neuron-like unit used in some artificial neural 
networks (Section 9.6). Its input at time t is a feature vector x(5 t ); its output, A t , is a random variable 
having two values, 0 and 1, with Pr{A t = 1} = P t and Pr{Aj = 0} = 1 — Pt (the Bernoulli distribution). 
Let h(s , 0, 0) and h(s , 1, 0) be the preferences in state s for the unit’s two actions given policy parameter 
6. Assume that the difference between the preferences is given by a weighted sum of the unit’s input 
vector, that is, assume that h(s , 1, G ) — h(s, 0, 9) = 0 T x(s), where 6 is the unit’s weight vector. 

(a) Show that if the exponential softmax distribution (13.2) is used to convert preferences to policies, 
then P t = 7r(l|S'i, G t ) = 1/(1 + exp(— 6jx(S t ))) (the logistic function). 

(b) What is the Monte-Carlo REINFORCE update of 9 t to 0 t+ i upon receipt of return Gt? 

(c) Express the eligibility Ve ln7r(a|s, 6) for a Bernoulli-logistic unit, in terms of a, x(s), and 7r(a|s, 0) 
by calculating the gradient. 

Hint: separately for each action compute the derivative of the logorithm first with respect to Pt = 
ir(a\s,6 t ), combine the two results into one expression that depends on a and Pt, and then use the 
chain rule, noting that the derivative of the logistic function f(x) is f(x)(l — f{x)). □ 
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13.8 Summary 

Prior to this chapter, this book focused on action-value methods —meaning methods that learn action 
values and then use them to determine action selections. In this chapter, on the other hand, we con¬ 
sidered methods that learn a parameterized policy that enables actions to be taken without consulting 
action-value estimates—though action-value estimates may still be learned and used to update the 
policy parameter. In particular, we have considered policy-gradient methods —meaning methods that 
update the policy parameter on each step in the direction of an estimate of the gradient of performance 
with respect to the policy parameter. 

Methods that learn and store a policy parameter have many advantages. They can learn specific 
probabilities for taking the actions. They can learn appropriate levels of exploration and approach 
deterministic policies asymptotically. They can naturally handle continuous state spaces. All these 
things are easy for policy-based methods but awkward or impossible for e-greedy methods and for 
action-value methods in general. In addition, on some problems the policy is just simpler to represent 
parametrically than the value function; these problems are more suited to parameterized policy methods. 

Parameterized policy methods also have an important theoretical advantage over action-value meth¬ 
ods in the form of the policy gradient theorem , which gives an exact formula for how performance is 
affected by the policy parameter that does not involve derivatives of the state distribution. This theorem 
provides a theoretical foundation for all policy gradient methods. 

The REINFORCE method follows directly from the policy gradient theorem. Adding a state-value 
function as a baseline reduces REINFORCE’s variance without introducing bias. Using the state-value 
function for bootstrapping introduces bias but is often desirable for the same reason that bootstrapping 
TD methods are often superior to Monte Carlo methods (substantially reduced variance). The state- 
value function assigns credit to—critizes—the policy’s action selections, and accordingly the former is 
termed the critic and the latter the actor , and these overall methods are sometimes termed actor-critic 
methods. 

Overall, policy-gradient methods provide a significantly different set of strengths, and weaknesses 
than action-value methods. Today they are less well understood in some respects, but a subject of 
excitement and ongoing research. 


Bibliographical and Historical Remarks 

Methods that we now see as related to policy gradients were actually some of the earliest to be studied 
in reinforcement learning (Witten, 1977; Barto, Sutton, and Anderson, 1983; Sutton, 1984; Williams, 
1987, 1992) and in predecessor fields (Phansalkar and Thatlraclrar, 1995). They were largely supplanted 
in the 1990s by the action-value methods that are the focus of the other chapters of this book. In 
recent years, however, attention has returned to actor-critic methods and to policy-gradient methods 
in general. Among the further developments beyond what we cover here are natural-gradient methods 
(Arnari, 1998; Kakade, 2002, Peters, Vijayakumar and Schaal, 2005; Peters and Schall, 2008; Park, 
Kim and Kang, 2005; Bhatnagar, Sutton, Ghavamzadeh and Lee, 2009; see Grondnran, Busoniu, Lopes 
and Babuska, 2012), and deterministic policy gradient (Silver et al., 2014). Major applications include 
acrobatic helicopter autopilots and AlphaGo (see Section 16.6). 

Our presentation in this chapter is based primarily on that by Sutton, McAllester, Singh, and Man- 
sour (2000, see also Sutton, Singh, and McAllester, 2000), who introduced the term “policy gradient 
methods.” A useful overview is provided by Bhatnagar et al. (2003). One of the earliest related works 
is by Aleksandrov, Sysoyev, and Shemeneva (1968). 

13.1 Example 13.1 and the results with it in this chapter were developed with Eric Graves. 
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13.2 The policy gradient theorem here and on page 276 was first obtained by Marbach and Tsitsiklis 
(1998, 2001) and then independently by Sutton et al. (2000). A similar expression was obtained 
by Cao and Chen (1997). Other early results are due to Konda and Tsitsiklis (2000, 2003) and 
Baxter and Bartlett (2000). Some additional results are developed by Sutton 

13.3 REINFORCE is due to Williams (1987, 1992). 

Phansalkar and Thathachar (1995) proved both local and global convergence theorems for 
modified versions of REINFORCE algorithms. 

13.4 The baseline was introduced in Williams’s (1987, 1992) original work. Greensmith, Bartlett, 
and Baxter (2004) analyzed an arguably better baseline (see Dick, 2015). 

13.5—6 Actor-critic methods were among the earliest to be investigated in reinforcement learning (Wit¬ 
ten, 1977; Barto, Sutton, and Anderson, 1983; Sutton, 1984). The algorithms presented here 
and in Section 13.6 are based on the work of Degris, White, and Sutton (2012), who also 
introduced the study of off-policy policy-gradient methods. 

13.7 The first to show how continuous actions could be handled this way appears to have been 
Williams (1987, 1992). 



Part III: Looking Deeper 


In this last part of the book we look beyond the standard reinforcement learning ideas presented in 
the first two parts of the book to briefly survey their relationships with psychology and neuroscience, a 
sampling of reinforcement learning applications, and some of the active frontiers for future reinforcement 
learning research. 
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Chapter 14 


Psychology 


In previous chapters we developed ideas for algorithms based on computational considerations alone. In 
this chapter we look at some of these algorithms from another perspective: the perspective of psychology 
and its study of how animals learn. The goals of this chapter are, first, to discuss ways that reinforcement 
learning ideas and algorithms correspond to what psychologists have discovered about animal learning, 
and second, to explain the influence reinforcement learning is having on the study of animal learning. 
The clear formalism provided by reinforcement learning that systemizes tasks, returns, and algorithms 
is proving to be enormously useful in making sense of experimental data, in suggesting new kinds of 
experiments, and in pointing to factors that may be critical to manipulate and to measure. The idea 
of optimizing return over the long term that is at the core of reinforcement learning is contributing to 
our understanding of otherwise puzzling features of animal learning and behavior. 

Some of the correspondences between reinforcement learning and psychological theories are not sur¬ 
prising because the development of reinforcement learning drew inspiration from psychological learning 
theories. However, as developed in this book, reinforcement learning explores idealized situations from 
the perspective of an artificial intelligence researcher or engineer, with the goal of solving computa¬ 
tional problems with efficient algorithms, rather than to to replicate or explain in detail how animals 
learn. As a result, some of the correspondences we describe connect ideas that arose independently 
in their respective fields. We believe these points of contact are specially meaningful because they 
expose computational principles important to learning, whether it is learning by artificial or by natural 
systems. 

For the most part, we describe correspondences between reinforcement learning and learning theories 
developed to explain how animals like rats, pigeons, and rabbits learn in controlled laboratory exper¬ 
iments. Thousands of these experiments were conducted throughout the 20th century, and many are 
still being conducted today. Although sometimes dismissed as irrelevant to wider issues in psychology, 
these experiments probe subtle properties of animal learning, often motivated by precise theoretical 
questions. As psychology shifted its focus to more cognitive aspects of behavior, that is, to mental 
processes such as thought and reasoning, animal learning experiments came to play less of a role in 
psychology than they once did. But this experimentation led to the discovery of learning principles that 
are elemental and widespread throughout the animal kingdom, principles that should not be neglected 
in designing artificial learning systems. In addition, as we shall see, some aspects of cognitive processing 
connect naturally to the computational perspective provided by reinforcement learning. 

This chapter’s final section includes references relevant to the connections we discuss as well as to 
connections we neglect. We hope this chapter encourages readers to probe all of these connections more 
deeply. Also included in this final section is a discussion of how the terminology used in reinforcement 
learning relates to that of psychology. Many of the terms and phrases used in reinforcement learning are 
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borrowed from animal learning theories, but the computational/engineering meanings of these terms 
and phrases do not always coincide with their meanings in psychology. 


14.1 Prediction and Control 

The algorithms we describe in this book fall into two broad categories: algorithms for prediction and 
algorithms for control. These categories arise naturally in solution methods for the reinforcement 
learning problem presented in Chapter 3. In many ways these categories respectively correspond to 
categories of learning extensively studied by psychologists: classical, or Pavlovian, conditioning and 
instrumental, or operant, conditioning. These correspondences are not completely accidental because 
of psychology’s influence on reinforcement learning, but they are nevertheless striking because they 
connect ideas arising from different objectives. 

The prediction algorithms presented in this book estimate quantities that depend on how features of 
an agent’s environment are expected to unfold over the future. We specifically focus on estimating the 
amount of reward an agent can expect to receive over the future while it interacts with its environment. 
In this role, prediction algorithms are policy evaluation algorithms, which are integral components of 
algorithms for improving policies. But prediction algorithms are not limited to predicting future reward; 
they can predict any feature of the environment (see, for example, Modayil, White, and Sutton, 2014). 
The correspondence between prediction algorithms and classical conditioning rests on their common 
property of predicting upcoming stimuli, whether or not those stimuli are rewarding (or punishing). 

The situation in an instrumental, or operant, conditioning experiment is different. Here, the exper¬ 
imental apparatus is set up so that an animal is given something it likes (a reward) or something it 
dislikes (a penalty) depending on what the animal did. The animal learns to increase its tendency to 
produce rewarded behavior and to decrease its tendency to produce penalized behavior. The reinforcing 
stimulus is said to be contingent on the animal’s behavior, whereas in classical conditioning it is not 
(although it is difficult to remove all behavior contingencies in a classical conditioning experiment). 
Instrumental conditioning experiments are like those that inspired Thorndike’s Law of Effect that we 
briefly discuss in Chapter 1 . Control is at the core of this form of learning, which corresponds to the 
operation of reinforcement learning’s policy-improvement algorithms. 1 

Thinking of classical conditioning in terms of prediction, and instrumental conditioning in terms of 
control, is a starting point for connecting our computational view of reinforcement learning to animal 
learning, but in reality, the situation is more complicated than this. There is more to classical condi¬ 
tioning than prediction; it also involves action, and so is a mode of control, sometimes called Pavlovian 
control. Further, classical and instrumental conditioning interact in interesting ways, with both sorts 
of learning likely being engaged in most experimental situations. Despite these complications, align¬ 
ing the classical/instrumental distinction with the prediction/control distinction is a convenient first 
approximation in connecting reinforcement learning to animal learning. 

In psychology, the term reinforcement is used to describe learning in both classical and instrumental 
conditioning. Originally referring only to the strengthening of a pattern of behavior, it is frequently also 
used for the weakening of a pattern of behavior. A stimulus considered to be the cause of the change in 
behavior is called a reinforcer, wether or not it is contingent on the animal’s previous behavior. At the 
end of this chapter we discuss this terminology in more detail and how it relates to terminology used 
in machine learning. 


1 What, control means for us is different from what it typically means in animal learning theories; there the environment 
controls the agent instead of the other way around. See our comments on terminology at the end of this chapter. 
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14.2 Classical Conditioning 

While studying the activity of the digestive system, the celebrated Russian physiologist Ivan Pavlov 
found that an animal’s innate responses to certain triggering stimuli can come to be triggered by other 
stimuli that are quite unrelated to the inborn triggers. His experimental subjects were dogs that had 
undergone minor surgery to allow the intensity of their salivary reflex to be accurately measured. In 
one case he describes, the dog did not salivate under most circumstances, but about 5 seconds after 
being presented with food it produced about six drops of saliva over the next several seconds. After 
several repetitions of presenting another stimulus, one not related to food, in this case the sound of a 
metronome, shortly before the introduction of food, the dog salivated in response to the sound of the 
metronome in the same way it did to the food. “The activity of the salivary gland has thus been called 
into play by impulses of sound—a stimulus quite alien to food” (Pavlov, 1927, p. 22). Summarizing the 
significance of this finding, Pavlov wrote: 

It is pretty evident that under natural conditions the normal animal must respond not only 
to stimuli which themselves bring immediate benefit or harm, but also to other physical or 
chemical agencies—waves of sound, light, and the like—which in themselves only signal the 
approach of these stimuli; though it is not the sight and sound of the beast of prey which is 
in itself harmful to the smaller animal, but its teeth and claws. (Pavlov, 1927, p. 14) 

Connecting new stimuli to innate reflexes in this way is now called classical, or Pavlovian, condi¬ 
tioning. Pavlov (or more exactly, his translators) called inborn responses (e.g., salivation in his demon¬ 
stration described above) “unconditioned responses” (URs), their natural triggering stimuli (e.g., food) 
“unconditioned stimuli” (USs), and new responses triggered by predictive stimuli (e.g., here also sali¬ 
vation) “conditioned responses” (CRs). A stimulus that is initially neutral, meaning that it does not 
normally elicit strong responses (e.g., the metronome sound), becomes a “conditioned stimulus” (CS) 
as the animal learns that it predicts the US and so comes to produce a CR in response to the CS. These 
terms are still used in describing classical conditioning experiments (though better translations would 
have been “conditional” and “unconditional” instead of conditioned and unconditioned). The US is 
called a reinforcer because it reinforces producing a CR in response to the CS. 

Figure 14.1 shows the arrangement of stimuli in two types of classical conditioning experiments: in 
delay conditioning, the CS extends throughout the interstimulus interval, or ISI, which is the time 
interval between the CS onset and the US onset (with the CS ending when the US ends in a common 
version shown here). In trace conditioning, the US begins after the CS ends, and the time interval 
between CS offset and US onset is called the trace interval. 

The salivation of Pavlov’s dogs to the sound of a metronome is just one example of classical condition¬ 
ing, which has been intensively studied across many response systems of many species of animals. URs 
are often preparatory in some way, like the salivation of Pavlov’s dog, or protective in some way, like 
an eye blink in response to something irritating to the eye, or freezing in response to seeing a predator. 
Experiencing the CS-US predictive relationship over a series of trials causes the animal to learn that 
the CS predicts the US so that the animal can respond to the CS with a CR that prepares the animal 
for, or protects it from, the predicted US. Some CRs are similar to the UR but begin earlier and differ 
in ways that increase their effectiveness. In one intensively studied type of experiment, for example, 
a tone CS reliably predicts a puff of air (the US) to a rabbit’s eye, triggering a UR consisting of the 
closure of a protective inner eyelid called the nictitating membrane. After one or more trials, the tone 
comes to trigger a CR consisting of membrane closure that begins before the air puff and eventually 
becomes timed so that peak closure occurs just when the air puff is likely to occur. This CR, being 
initiated in anticipation of the air puff and appropriately timed, offers better protection than simply 
initiating closure as a reaction to the irritating US. The ability to act in anticipation of important events 
by learning about predictive relationships among stimuli is so beneficial that it is widely present across 
the animal kingdom. 
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Figure 14.1: Arrangement of stimuli in two types of classical conditioning experiments. In delay conditioning, 
the CS extends throughout the interstimulus interval, or ISI, which is the time interval between the CS onset 
and the US onset (often with the CS and US ending at the same time as shown here). In trace conditioning, 
there is a time interval, called the trace interval, between CS offset and US onset. 


14.2.1 Blocking and Higher-order Conditioning 

Many interesting properties of classical conditioning have been observed in experiments. Beyond the 
anticipatory nature of CRs, two widely observed properties figured prominently in the development 
of classical conditioning models: blocking and higher-order conditioning. Blocking occurs when an 
animal fails to learn a CR when a potential CS in presented along with another CS that had been used 
previously to condition the animal to produce that CR. For example, in the first stage of a blocking 
experiment involving rabbit nictitating membrane conditioning, a rabbit is first conditioned with a tone 
CS and an air puff US to produce the CR of closing its nictitating membrane in anticipation of the 
air puff. The experiment’s second stage consists of additional trials in which a second stimulus, say a 
light, is added to the tone to form a compound tone/light CS followed by the same air puff US. In the 
experiment’s third phase, the second stimulus alone—the light—is presented to the rabbit to see if the 
rabbit has learned to respond to it with a CR. It turns out that the rabbit produces very few, or no, 
CRs in response to the light: learning to the light had been blocked by the previous learning to the 
tone. 2 Blocking results like this challenged the idea that conditioning depends only on simple temporal 
contiguity, that is, that a necessary and sufficient condition for conditioning is that a US frequently 
follows a CS closely in time. In the next section we describe the Rescorla-Wagner model (Rescorla and 
Wagner, 1972) that offered an influential explanation for blocking. 

Higher-order conditioning occurs when a previously-conditioned CS acts as a US in conditioning an¬ 
other initially neutral stimulus. Pavlov described an experiment in which his assistant first conditioned 
a dog to salivate to the sound of a metronome that predicted a food US, as described above. After this 
stage of conditioning, a number of trials were conducted in which a black square, to which the dog was 
initially indifferent, was placed in the dog’s line of vision followed by the sound of the metronome— 
and this was not followed by food. In just ten trials, the dog began to salivate merely upon seeing 
the black square, despite the fact that the sight of it had never been followed by food. The sound of 

2 Comparison with a control group is necessary to show that the previous conditioning to the tone is responsible for 
blocking learning to the light. This is done by trials with the tone/light CS but with no prior conditioning to the tone. 
Learning to the light in this case is unimpaired. Moore and Schmajuk (2008) give a full account of this procedure. 
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the metronome itself acted as a US in conditioning a salivation CR to the black square CS. This was 
second-order conditioning. If the black square had been used as a US to establish salivation CRs to 
another otherwise neutral CS, it would have been third-order conditioning, and so on. Higher-order 
conditioning is difficult to demonstrate, especially above the second order, in part because a higher- 
order reinforcer loses its reinforcing value due to not being repeatedly followed by the original US during 
higher-order conditioning trials. But under the right conditions, such as intermixing first-order trials 
with higher-order trials or by providing a general energizing stimulus, higher-order conditioning beyond 
the second order can be demonstrated. As we describe below, the TD model of classical conditioning 
uses the bootstrapping idea that is central to our approach to extend the Rescorla-Wagner model’s 
account of blocking to include both the anticipatory nature of CRs and higher-order conditioning. 

Higher-order instrumental conditioning occurs as well. In this case, a stimulus that consistently pre¬ 
dicts primary reinforcement becomes a reinforcer itself, where reinforcement is primary if its rewarding 
or penalizing quality has been built into the animal by evolution. The predicting stimulus becomes 
a secondary reinforcer, or more generally, a higher-order or conditioned reinforcer —the latter being a 
better term when the predicted reinforcing stimulus is itself a secondary, or an even higher-order, rein¬ 
forcer. A conditioned reinforcer delivers conditioned reinforcement: conditioned reward or conditioned 
penalty. Conditioned reinforcement acts like primary reinforcement in increasing an animal’s tendency 
to produce behavior that leads to conditioned reward, and to decrease an animal’s tendency to produce 
behavior that leads to conditioned penalty. (See our comments at the end of this chapter that explain 
how our terminology sometimes differs, as it does here, from terminology used in psychology.) 

Conditioned reinforcement is a key phenomenon that explains, for instance, why we work for the 
conditioned reinforcer money, whose worth derives solely from what is predicted by having it. In actor- 
critic methods described in Section 13.5 (and discussed in the context of neuroscience in Sections 15.7 
and 15.8), the critic uses a TD method to evaluate the actor’s policy, and its value estimates provide 
conditioned reinforcement to the actor, allowing the actor to improve its policy. This analog of higher- 
order instrumental conditioning helps address the credit-assignment problem mentioned in Section 1.7 
because the critic gives moment-by-moment reinforcement to the actor when the primary reward signal 
is delayed. We discuss this more below in Section 14.4. 

14.2.2 The Rescorla-Wagner Model 

Rescorla and Wagner created their model mainly to account for blocking. The core idea of the Rescorla- 
Wagner model is that an animal only learns when events violate its expectations, in other words, only 
when the animal is surprised (although without necessarily implying any conscious expectation or 
emotion). We first present Rescorla and Wagner’s model using their terminology and notation before 
shifting to the terminology and notation we use to describe the TD model. 

Here is how Rescorla and Wagner described their model. The model adjusts the “associative strength” 
of each component stimulus of a compound CS, which is a number representing how strongly or reliably 
that component is predictive of a US. When a compound CS consisting of several component stimuli is 
presented in a classical conditioning trial, the associative strength of each component stimulus changes 
in a way that depends on an associative strength associated with the entire stimulus compound, called 
the “aggregate associative strength,” and not just on the associative strength of each component itself. 

Rescorla and Wagner considered a compound CS AX, consisting of component stimuli A and X, where 
the animal may have already experienced stimulus A, and stimulus X might be new to the animal. Let 
Va, Vx, an d Lax respectively denote the associative strengths of stimuli A, X, and the compound AX. 
Suppose that on a trial the compound CS AX is followed by a US, which we label stimulus Y. Then 
the associative strengths of the stimulus components change according to these expressions: 

AVa = «a/?y(Ry — Tax) 

AU X = «x/3y(Ry - Vax), 



288 


CHAPTER 14. PSYCHOLOGY 


where cxa(3y and ax/dy are the step-size parameters, which depend on the identities of the CS compo¬ 
nents and the US, and Ry is the asymptotic level of associative strength that the US Y can support. 
(Rescorla and Wagner used A here instead of R , but we use R to avoid confusion with our use of A 
and because we usually think of this as the magnitude of a reward signal, with the caveat that the US 
in classical conditioning is not necessarily rewarding or penalizing.) A key assumption of the model is 
that the aggregate associative strength Uax is equal to Va + Vx- The associative strengths as changed 
by these As become the associative strengths at the beginning of the next trial. 

To be complete, the model needs a response-generation mechanism, which is a way of mapping values 
of Us to CRs. Since this mapping would depend on details of the experimental situation, Rescorla and 
Wagner did not specify a mapping but simply assumed that larger Us would produce stronger or more 
likely CRs, and that negative Us would mean that there would be no CRs. 

The Rescorla-Wagner model accounts for the acquisition of CRs in a way that explains blocking. As 
long as the aggregate associative strength, Uax, of the stimulus compound is below the asymptotic level 
of associative strength, Ry, that the US Y can support, the prediction error Ry — Vax is positive. This 
means that over successive trials the associative strengths Va and Vx of the component stimuli increase 
until the aggregate associative strength Uax equals Ry, at which point the associative strengths stop 
changing (unless the US changes). When a new component is added to a compound CS to which the 
animal has already been conditioned, further conditioning with the augmented compound produces little 
or no increase in the associative strength of the added CS component because the error has already been 
reduced to zero, or to a low value. The occurrence of the US is already predicted nearly perfectly, so 
little or no error—or surprise—is introduced by the new CS component. Prior learning blocks learning 
to the new component. 

To transition from Rescorla and Wagner’s model to the TD model of classical conditioning (which 
we just call the TD model), we first recast their model in terms of the concepts that we are using 
throughout this book. Specifically, we match the notation we use for learning with linear function 
approximation (Section 9.4), and we think of the conditioning process as one of learning to predict the 
“magnitude of the US” on a trial on the basis of the compound CS presented on that trial, where the 
magnitude of a US Y is the Ry of the Rescorla-Wagner model as given above. We also introduce states. 
Because the Rescorla-Wagner model is a trial-level model, meaning that it deals with how associative 
strengths change from trial to trial without considering any details about what happens within and 
between trials, we do not have to consider how states change during a trial until we present the full TD 
model in the following section. Instead, here we simply think of a state as a way of labeling a trial in 
terms of the collection of component CSs that are present on the trial. 

Therefore, assume that trial-type, or state, s is described by a real-valued vector of features x(s) = 
(a;i(s), X 2 {s), ... ,a;d(s)) T where Xi(s) = 1 if CSj, the i th component of a compound CS, is present on 
the trial and 0 otherwise. Then if the d-dimensional vector of associative strengths is w, the aggregate 
associative strength for trial-type s is 

0(s,w) = w T x(s). (14.1) 

This corresponds to a value estimate in reinforcement learning, and we think of it as the US prediction. 

Now temporally let t denote the number of a complete trial and not its usual meaning as a time step 
(we revert to t's usual meaning when we extend this to the TD model below), and assume that St is 
the state corresponding to trial t. Conditioning trial t updates the associative strength vector w t to 
w i+1 as follows: 

w t+1 = w t + aS t x(S t ), (14-2) 

where a is the step-size parameter, and—because here we are describing the Rescorla-Wagner model— 5t 
is the prediction error 

$t = Rt ~ v(S t , wj). 


(14.3) 
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R t is the target of the prediction on trial t, that is, the magnitude of the US, or in Rescorla and Wagner’s 
terms, the associative strength that the US on the trial can support. Note that because of the factor 
x(St) in (14.2), only the associative strengths of CS components present on a trial are adjusted as a 
result of that trial. You can think of the prediction error as a measure of surprise, and the aggregate 
associative strength as the animal’s expectation that is violated when it does not match the target US 
magnitude. 

From the perspective of machine learning, the Rescorla-Wagner model is an error-correction super¬ 
vised learning rule. It is essentially the same as the Least Mean Square (LMS), or Widrow-Hoff, learning 
rule (Widrow and Hoff, 1960) that finds the weights—here the associative strengths—that make the 
average of the squares of all the errors as close to zero as possible. It is a “curve-fitting,” or regression, 
algorithm that is widely used in engineering and scientific applications (see Section 9.4). 3 

The Rescorla-Wagner model was very influential in the history of animal learning theory because it 
showed that a “mechanistic” theory could account for the main facts about blocking without resorting 
to more complex cognitive theories involving, for example, an animal’s explicit recognition that another 
stimulus component had been added and then scanning its short-term memory backward to reassess 
the predictive relationships involving the US. The Rescorla-Wagner model showed how traditional 
contiguity theories of conditioning—that temporal contiguity of stimuli was a necessary and sufficient 
condition for learning—could be adjusted in a simple way to account for blocking (Moore and Schmajuk, 
2008). 

The Rescorla-Wagner model provides a simple account of blocking and some other features of classical 
conditioning, but it is not a complete or perfect model of classical conditioning. Different ideas account 
for a variety of other observed effects, and progress is still being made toward understanding the many 
subtleties of classical conditioning. The TD model, which we describe next, though also not a complete 
or perfect model model of classical conditioning, extends the Rescorla-Wagner model to address how 
within-trial and between-trial timing relationships among stimuli can influence learning and how higher- 
order conditioning might arise. 


14.2.3 The TD Model 

The TD model is a real-time model, as opposed to a trial-level model like the Rescorla-Wagner model. 
A single step t, in the our formulation of Rescorla and Wagner’s model above represents an entire 
conditioning trial. The model does not apply to details about what happens during the time a trial 
is taking place, or what might happen between trials. Within each trial an animal might experience 
various stimuli whose onsets occur at particular times and that have particular durations. These 
timing relationships strongly influence learning. The Rescorla-Wagner model also does not include 
a mechanism for higher-order conditioning, whereas for the TD model, higher-order conditioning is a 
natural consequence of the bootstrapping idea that is at the base of TD algorithms. 

To describe the TD model we begin with the formulation of the Rescorla-Wagner model above, but 
t now labels time steps within or between trials instead of complete trials. Think of the time between 
t and t + 1 as a small time interval, say .01 second, and think of a trial as a sequences of states, 
one associated with each time step, where the state at step t now represents details of how stimuli 
are represented at t instead of just a label for the CS components present on a trial. In fact, we can 
completely abandon the idea of trials. From the point of view of the animal, a trial is just a fragment 
of its continuing experience interacting with its world. Following our usual view of an agent interacting 
with its environment, imagine that the animal is experiencing an endless sequence of states s, each 
represented by a feature vector x(s). That said, it is still often convenient to refer to trials as fragments 

“The only differences between the LMS rule and the Rescorla-Wagner model are that for LMS the input vectors x/ 
can have any real numbers as components, and—at least in the simplest version of the LMS rule—the step-size parameter 
a does not depend on the input vector or the identity of the stimulus setting the prediction target. 
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of time during which patterns of stimuli repeat in an experiment. 

State features are not restricted to describing the external stimuli that an animal experiences; they can 
describe neural activity patterns that external stimuli produce in an animal’s brain, and these patterns 
can be history-dependent, meaning that they can be persistent patterns produced by sequences of 
external stimuli. Of course, we do not know exactly what these neural activity patterns are, but a real¬ 
time model like the TD model allows one to explore the consequences on learning of different hypotheses 
about the internal representations of external stimuli. For these reasons, the TD model does not commit 
to any particular state representation. In addition, because the TD model includes discounting and 
eligibility traces that span time intervals between stimuli, the model also makes it possible to explore 
how discounting and eligibility traces interact with stimulus representations in making predictions about 
the results of classical conditioning experiments. 

Below we describe some of the state representations that have been used with the TD model and 
some of their implications, but for the moment we stay agnostic about the representation and just 
assume that each state s is represented by a feature vector x(s) = (xi(s), X 2 (s),..., s n (s)) T . Then 
the aggregate associative strength corresponding to a state s is given by (14.1), the same as for the 
Rescorla-Wgner model, but the TD model updates the associative strength vector, w, differently. With 
t now labeling a time step instead of a complete trial, the TD model governs learning according to this 
update: 


w t+ i = w t + aS t z t , (14.4) 

which replaces x t (5 t ) in the Rescorla-Wagner update (14.2) with z t , a vector of eligibility traces, and 
instead of the 5 t of (14.3), here 8 t is a TD error: 

5 t = R t+ 1 + jv{S t+1 ,Mv t ) - v(S t ,w t ), (14.5) 

where 7 is a discount factor (between 0 and 1), R t is the prediction target at time t, and ■0(S' t+ i,w t ) 
and v(S t , w t ) are aggregate associative strengths at t + 1 and t as defined by (14.1). 

Each component i of the eligibility-trace vector z t increments or decrements according to the com¬ 
ponent Xi(St) of the feature vector x(5 t ), and otherwise decays with a rate determined by 7 A: 


z t+ i = 7 AZ t + x(S t ). 


(14.6) 


Here A is the usual eligibility trace decay parameter. 

Note that if 7 = 0, the TD model reduces to the Rescorla-Wagner model with the exceptions that: 
the meaning of t is different in each case (a trial number for the Rescorla-Wagner model and a time 
step for the TD model), and in the TD model there is a one-time-step lead in the prediction target R. 
The TD model is equivalent to the backward view of the semi-gradient TD(A) algorithm with linear 
function approximation (Chapter 12 ), except that R t in the model does not have to be a reward signal 
as it does when the TD algorithm is used to learn a value function for policy-improvement. 


14.2.4 TD Model Simulations 

Real-time conditioning models like the TD model are interesting primarily because they make predic¬ 
tions for a wide range of situations that cannot be represented by trial-level models. These situations 
involve the timing and durations of conditionable stimuli, the timing of these stimuli in relation to the 
timing of the US, and the timing and shapes of CRs. For example, the US generally must begin after 
the onset of a neutral stimulus for conditioning to occur, with the rate and effectiveness of learning 
depending on the inter-stimulus interval, or ISI, the interval between the onsets of the CS and the US. 
When CRs appear, they generally begin before the appearance of the US and their temporal profiles 
change during learning. In conditioning with compound CSs, the component stimuli of the compound 
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CSs may not all begin and end at the same time, sometimes forming what is called a serial compound 
in which the component stimuli occur in a sequence over time. Timing considerations like these make 
it important to consider how stimuli are represented, how these representations unfold over time during 
and between trials, and how they interact with discounting and eligibility traces. 

Figure 14.2 shows three of the stimulus representations that have been used in exploring the behavior 
of the TD model: the complete serial compound (CSC), the microstimulus (MS), and the presence 
representations (Ludvig, Sutton, and Kehoe, 2012). These representations differ in the degree to which 
they force generalization among nearby time points during which a stimulus is present. 

The simplest of the representations shown in Figure 14.2 is the presence representation in the figure’s 
right column. This representation has a single feature for each component CS present on a trial, 
where the feature has value 1 whenever that component is present, and 0 otherwise. 4 The presence 
representation is not a realistic hypothesis about how stimuli are represented in an animal’s brain, but as 
we describe below, the TD model with this representation can produce many of the timing phenomena 
seen in classical conditioning. 

For the CSC representation (left column of Figure 14.2), the onset of each external stimulus initiates 
a sequence of precisely-timed short-duration internal signals that continues until the external stimulus 


4 In our formalism, there is a different state, St, for each time step t during a trial, and for a trial in which a compound 
CS consists of n component CSs of various durations occurring at various times throughout the trial, there is a feature, x t . 
for each component CSi, i = 1,... ,n, where xt(St) = 1 for all times t when the CSj is present, and equals zero otherwise. 


Complete Serial 
Compound 


Microstimuli 


Presence 



Figure 14.2: Three stimulus representations (in columns) sometimes used with the TD model. Each row 
represents one element of the stimulus representation. The three representations vary along a temporal gen¬ 
eralization gradient, with no generalization between nearby time points in the complete serial compound (left 
column) and complete generalization between nearby time points in the presence representation (right column). 
The microstimulus representation occupies a middle ground. The degree of temporal generalization determines 
the temporal granularity with which US predictions are learned. Adapted with minor changes from Learning 
& Behavior, Evaluating the TD Model of Classical Conditioning, volume 40, 2012, p. 311, E. A. Ludvig, R. S. 
Sutton, E. J. Kehoe. With permission of Springer. 
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ends. 5 This is like assuming the animal’s nervous system has a clock that keeps precise track of 
time during stimulus presentations; it is what engineers call a “tapped delay line.” Like the presence 
representation, the CSC representation is unrealistic as a hypothesis about how the brain internally 
represents stimuli, but Ludvig et al. (2012) call it a “useful fiction” because it can reveal details of 
how the TD model works when relatively unconstrained by the stimulus representation. The CSC 
representation is also used in most TD models of dopamine-producing neurons in the brain, a topic we 
take up in Chapter 15. The CSC representation is often viewed as an essential part of the TD model, 
although this view is mistaken. 

The MS representation (center column of Figure 14.2) is like the CSC representation in that each 
external stimulus initiates a cascade of internal stimuli, but in this case the internal stimuli -the 
microstimuli—are not of such limited and non-overlapping form; they are extended over time and 
overlap. As time elapses from stimulus onset, different sets of microstimuli become more or less active, 
and each subsequent microstimulus becomes progressively wider in time and reaches a lower maximal 
level. Of course, there are many MS representations depending on the nature of the microstimuli, and 
a number of examples of MS representations have been studied in the literature, in some cases along 
with proposals for how an animal’s brain might generate them (see the Bibliographic and Historical 
Comments at the end of this chapter). MS representations are more realistic than the presence or CSC 
representations as hypotheses about neural representations of stimuli, and they allow the behavior of 
the TD model to be related to a broader collection of phenomena observed in animal experiments. In 
particular, by assuming that cascades of microstimuli are initiated by USs as well as by CSs, and by 
studying the significant effects on learning of interactions between microstimuli, eligibility traces, and 
discounting, the TD model is helping to frame hypotheses to account for many of the subtle phenomena 
of classical conditioning and how an animal’s brain might produce them. We say more about this below, 
particularly in Chapter 15 where we discuss reinforcement learning and neuroscience. 

Even with the simple presence representation, however, the TD model produces all the basic prop¬ 
erties of classical conditioning that are accounted for by the Rescorla-Wagner model, plus features of 
conditioning that are beyond the scope of trial-level models. For example, as we have already men¬ 
tioned, a conspicuous feature of classical conditioning is that the US generally must begin after the 
onset of a neutral stimulus for conditioning to occur, and that after conditioning, the CR begins before 
the appearance of the US. In other words, conditioning generally requires a positive ISI, and the CR 
generally anticipates the US. How the strength of conditioning (e.g., the percentage of CRs elicited by 
a CS) depends on the ISI varies substantially across species and response systems, but it typically has 
the following properties: it is negligible for a zero or negative ISI, i.e., when the US onset occurs simul¬ 
taneously with, or earlier than, the CS onset (although research has found that associative strengths 
sometimes increase slightly or become negative with negative ISIs); it increases to a maximum at a 
positive ISI where conditioning is most effective; and it then decreases to zero after an interval that 
varies widely with response systems. The precise shape of this dependency for the TD model depends 
on the values of its parameters and details of the stimulus representation, but these basic features of 
ISI-dependency are core properties of the TD model. 

One of the theoretical issues arising with serial-compound conditioning, that is, conditioning with a 
compound CS whose components occur in a sequence, concerns the facilitation of remote associations. 
It has been found that if the empty trace interval between the CS and the US is filled with a second CS 
to form a serial-compound stimulus, then conditioning to the first CS is facilitated. Figure 14.3 shows 
the behavior of the TD model with the presence representation in a simulation of such an experiment 
whose timing details are shown at the top of the figure. Consistent with the experimental results 
(Kelroe, 1982), the model shows facilitation of both the rate of conditioning and the asymptotic level 


5 In our formalism, for each CS component CSi present on a trial, and for each time step t during a trial, there is a 
separate feature x\, where x\ (S t /) = 1 if t = t' for any t' at which CSi is present, and equals 0 otherwise. This is different 
from the CSC representation in Sutton and Barto (1990) in which there are the same distinct features for each time step 
but no reference to external stimuli; hence the name complete serial compound. 
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of conditioning of the first CS due to the presence of the second CS. 


CSA 


CSB 

US 



A 



TRIALS 


Figure 14.3: Facilitation of a remote association by an intervening stimulus in the TD model. Top: temporal 
relationships among stimuli within a trial. Bottom: behavior over trials of CSA’s associative strength when 
CSA is presented in a serial compound as shown in the top panel, and when presented in an identical temporal 
relationship to the US, only without CSB. Adapted from Sutton and Barto (1990). 


A well-known demonstration of the effects on conditioning of temporal relationships among stimuli 
within a trial is an experiment by Egger and Miller (1962) that involved two overlapping CSs in a 
delay configuration as shown in the top panel of Figure 14.4. Although CSB was in a better temporal 
relationship with the US, the presence of CSA substantially reduced conditioning to CSB as compared 
to controls in which CSA was absent. The bottom panel of Figure 14.4 shows the same result being 
generated by the TD model in a simulation of this experiment with the presence representation. 

The TD model accounts for blocking because it is an error-correcting learning rule like the Rescorla- 
Wagner model. Beyond accounting for basic blocking results, however, the TD model predicts (with the 
presence representation and more complex representations a well) that blocking is reversed if the blocked 
stimulus is moved earlier in time so that its onset occurs before the onset of the blocking stimulus. This 
feature of the TD model’s behavior deserves attention because it had not been observed at the time of 
the model’s introduction. Recall that in blocking, if an animal has already learned that one CS predicts 
a US, then learning that a newly-added second CS also predicts the US is much reduced, i.e., is blocked. 
But if the newly-added second CS begins earlier than the pretrained CS, then—according to the TD 
model— learning to the newly-added CS is not blocked. In fact, as training continues and the newly- 
added CS gains associative strength, and the pretrained CS loses associative strength. The behavior 
of the TD model under these conditions is shown in Figure 14.5. This simulation experiment differed 
from the Egger-Miller experiment of Figure 14.4 in that the shorter CS with the later onset was given 
prior training until it was fully associated with the US. This surprising prediction led Kehoe, Scheurs, 
and Graham (1987) to conduct the experiment using the well-studied rabbit nictitating membrane 
preparation. Their results confirmed the model’s prediction, and they noted that non-TD models have 
considerable difficulty explaining their data. 

With the TD model, an earlier predictive stimulus takes precedence over a later predictive stimulus 
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.1 sec 



Figure 14.4: The Egger-Miller, or primacy, effect in the TD model. Top: temporal relationships among stimuli 
within a trial. Bottom: behavior over trials of CSB’s associative strength when CSB is presented with and 
without CSA. Adapted from Sutton and Barto (1990). 


because, like all the prediction methods described in this book, the TD model is based on the backing- 
up or bootstrapping idea: updates to associative strengths shift the strengths at a particular state 
toward the strength at later states. Another consequence of bootstrapping is that the TD model 
provides an account of higher-order conditioning, a feature of classical conditioning that is beyond the 
scope of the Rescoral-Wagner and similar models. As we described above, higher-order conditioning 
is the phenomenon in which a previously-conditioned CS can act as a US in conditioning another 
initially neutral stimulus. Figure 14.6 shows the behavior of the TD model (again with the presence 
representation) in a higher-order conditioning experiment—in this case it is second-order conditioning. 
In the first phase (not shown in the figure), CSB is trained to predict a US so that its associative strength 
increases, here to 1.6. In the second phase, CSA is paired with CSB in the absence of the US, in the 
sequential arrangement shown at the top of the figure. CSA acquires associative strength even though 
it is never paired with the US. With continued training, CSA’s associative strength reaches a peak and 
then decreases because the associative strength of CSB, the secondary reinforcer, decreases so that it 
loses its ability to provide secondary reinforcement. CSB’s associative strength decreases because the 
US does not occur in these higher-order conditioning trials. These are extinction trials for CSB because 
its predictive relationship to the US is disrupted so that its ability to act as a reinforcer decreases. This 
same pattern is seen in animal experiments. This extinction of conditioned reinforcement in higher- 
order conditioning trials makes it difficult to demonstrate higher-order conditioning unless the original 
predictive relationships are periodically refreshed by occasionally inserting first-order trials. 

The TD model produces an analog of second- and higher-order conditioning because 7 t)(iS t+ i,w t ) — 
D(S' t ,w t ) appears in the TD error S t (14.5). This means that as a result of previous learning, 7 i)(S' t+ i,w t ) 
can differ from h(S' i ,w i ), making St non-zero (a temporal difference). This difference has the same status 
as Rt+i in (14.5), implying that as far as learning is concerned there is no difference between a temporal 
difference and the occurrence of a US. In fact, this feature of the TD algorithm is one of the major 
reasons for its development, which we now understand through its connection to dynamic programming 
as described in Chapter 6. Bootstrapping values is intimately related to second-order, and higher-order, 
conditioning. 

In the examples of the TD model’s behavior described above, we examined only the changes in 
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Figure 14.5: Temporal primacy overriding blocking in the TD model. Top: temporal relationships between 
stimuli. Bottom: behavior over trials of CSB’s associative strength when CSB is presented with and without 
CSA. The only difference between this simulation and that shown in Figure 14.4 was that here CSB started 
out fully conditioned—CSB’s associative strength was initially set to 1.653, the final level reached when CSB 
was presented alone for 80 trials, as in the “CSA-absent” case in Figure 14.4. Adapted from Sutton and Barto 
(1990). 


the associative strengths of the CS components; we did not look at what the model predicts about 
properties of an animal’s conditioned responses (CRs): their timing, shape, and how they develop 
over conditioning trials. These properties depend on the species, the response system being observed, 
and parameters of the conditioning trials, but in many experiments with different animals and different 
response systems, the magnitude of the CR, or the probability of a CR, increases as the expected time of 
the US approaches. For example, in classical conditioning of a rabbit’s nictitating membrane response 
that we mentioned above, over conditioning trials the delay from CS onset to when the nictitating 
membrane begins to move across the eye decreases over trials, and the amplitude of this anticipatory 
closure gradually increases over the interval between the CS and the US until the membrane reaches 
maximal closure at the expected time of the US. The timing and shape of this CR is critical to its 
adaptive significance—covering the eye too early reduces vision (even though the nictitating membrane 
is translucent), while covering it too late is of little protective value. Capturing CR features like these 
is challenging for models of classical conditioning. 

The TD model does not include as part of its definition any mechanism for translating the time 
course of the US prediction, 0(5),w t ), into a profile that can be compared with the properties of an 
animal’s CR. The simplest choice is to let the time course of a simulated CR equal the time course of 
the US prediction. In this case, features of simulated CRs and how they change over trials depend only 
on the stimulus representation chosen and the values of the model’s parameters a, 7 , and A. 

Figure 14.7 shows the time courses of US predictions at different points during learning with the 
three representations shown in Figure 14.2. For these simulations the US occurred 25 time steps after 
the onset of the CS, and a = .05, A = .95 and 7 = .97. With the CSC representation (Figure 14.7 left), 
the curve of the US prediction formed by the TD model increases exponentially throughout the interval 
between the CS and the US until it reaches a maximum exactly when the US occurs (at time step 25). 
This exponential increase is the result of discounting in the TD model learning rule. With the presence 
representation (Figure 14.7 middle), the US prediction is nearly constant while the stimulus is present 
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Figure 14.6: Second-order conditioning with the TD model. Top: temporal relationships between stimuli. 
Bottom: behavior of the associative strengths associated with CSA and CSB over trials. The second stimulus, 
CSB, has an initial associative strength of 1.653 at the beginning of the simulation. Adapted from Sutton and 
Barto (1990). 


because there is only one weight, or associative strength, to be learned for each stimulus. Consequently, 
the TD model with the presence representation cannot recreate many features of CR timing. With 
an MS representation (Figure 14.7 right), the development of the TD model’s US prediction is more 
complicated. After 200 trials the prediction’s profile is a reasonable approximation of the US prediction 
curve produced with the CSC representation. 

The US prediction curves shown in Figure 14.7 were not intended to precisely match profiles of CRs 
as they develop during conditioning in any particular animal experiment, but they illustrate the strong 
influence that the stimulus representation has on predictions derived from the TD model. Further, 
although we can only mention it here, how the stimulus representation interacts with discounting and 
eligibility traces is important in determining properties of the US prediction profiles produced by the 
TD model. Another dimension beyond what we can discuss here is the influence of different response- 
generation mechanisms that translate US predictions into CR profiles; the profiles shown in Figure 14.7 
are “raw” US prediction profiles. Even without any special assumption about how an animal’s brain 
might produce overt responses from US predictions, however, the profiles in Figure 14.7 for the CSC 
and MS representations increase as the time of the US approaches and reach a maximum at the time 
of the US, as is seen in many animal conditioning experiments. 

The TD model, when combined with particular stimulus representations and response-generation 
mechanisms, is able to account for a surprisingly-wide range of phenomena observed in animal classical 
conditioning experiments, but it is far from being a perfect model. To generate other details of classical 
conditioning the model needs to be extended, perhaps by adding model-based elements and mechanisms 
for adaptively altering some of its parameters. Other approaches to modeling classical conditioning 
depart significantly from the Rescorla-Wagner-style error-correction process. Bayesian models, for 
example, work within a probabilistic framework in which experience revises probability estimates. All 
of these models usefully contribute to our understanding of classical conditioning. 

Perhaps the most notable feature of the TD model is that it is based on a theory—the theory we 
have described in this book —that suggests an account of what an animal’s nervous system is trying to 
do while undergoing conditioning: it is trying to form accurate long-term predictions , consistent with 
the limitations imposed by the way stimuli are represented and how the nervous system works. In 
other words, it suggests a normative account of classical conditioning in which long-term, instead of 
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Figure 14.7: Time course of US prediction over the course of acquisition for the TD model with three dif¬ 
ferent stimulus representations. Left: With the complete serial compound (CSC), the US prediction increases 
exponentially through the interval, peaking at the time of the US. At asymptote (trial 200), the US prediction 
peaks at the US intensity (1 in these simulations). Middle: With the presence representation, the US prediction 
converges to an almost constant level. This constant level is determined by the US intensity and the length of 
the CS-US interval. Right: With the microstimulus representation, at asymptote, the TD model approximates 
the exponentially increasing time course depicted with the CSC through a linear combination of the different 
microstimuli. Adapted with minor changes from Learning & Behavior, Evaluating the TD Model of Classical 
Conditioning, volume 40, 2012, E. A. Ludvig, R. S. Sutton, E. J. Kehoe. With permission of Springer. 


immediate, prediction is a key feature. 

The development of the TD model of classical conditioning is one instance in which the explicit 
goal was to model some of the details of animal learning behavior. In addition to its standing as an 
algorithm, then, TD learning is also the basis of this model of aspects of biological learning. As we 
discuss in Chapter 15, TD learning has also turned out to underlie an influential model of the activity of 
neurons that produce dopamine, a chemical in the brain of mammals that is deeply involved in reward 
processing. These are instances in which reinforcement learning theory makes detailed contact with 
animal behavioral and neural data. 

We now turn to considering correspondences between reinforcement learning and animal behavior 
in instrumental conditioning experiments, the other major type of laboratory experiment studied by 
animal learning psychologists. 


14.3 Instrumental Conditioning 

In instrumental conditioning experiments learning depends on the consequences of behavior: the de¬ 
livery of a reinforcing stimulus is contingent on what the animal does. In classical conditioning ex¬ 
periments, in contrast, the reinforcing stimulus—the US—is delivered independently of the animal’s 
behavior. Instrumental conditioning is usually considered to be the same as operant conditioning , the 
term B. F. Skinner (1938, 1961) introduced for experiments with behavior-contingent reinforcement, 
though the experiments and theories of those who use these two terms differ in a number of ways, some 
of which we touch on below. We will exclusively use the term instrumental conditioning for experiments 
in which reinforcement is contingent upon behavior. The roots of instrumental conditioning go back 
to experiments performed by the American psychologist Edward Thorndike one hundred years before 
publication of the first edition of this book. 

Thorndike observed the behavior of cats when they were placed in “puzzle boxes” from which they 
could escape by appropriate actions (Figure 14.8). For example, a cat could open the door of one 
box by performing a sequence of three separate actions: depressing a platform at the back of the box, 
pulling a string by clawing at it, and pushing a bar up or down. When first placed in a puzzle box, 
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Figure 14.8: One of Thorndike’s puzzle boxes. Reprinted from Thorndike, Animal Intelligence: An Experimen¬ 
tal Study of the Associative Processes in Animals, The Psychological Review, Series of Monograph Supplements, 
11(4), Macmillan, New York, 1898. 


with food visible outside, all but a few of Thorndike’s cats displayed “evident signs of discomfort” and 
extraordinarily vigorous activity “to strive instinctively to escape from confinement” (Thorndike, 1898). 

In experiments with different cats and boxes with different escape mechanisms, Thorndike recorded 
the amounts of time each cat took to escape over multiple experiences in each box. He observed that 
the time almost invariably decreased with successive experiences, for example, from 300 seconds to 6 
or 7 seconds. He described cats’ behavior in a puzzle box like this: 

The cat that is clawing all over the box in her impulsive struggle will probably claw the 
string or loop or button so as to open the door. And gradually all the other non-successful 
impulses will be stamped out and the particular impulse leading to the successful act will 
be stamped in by the resulting pleasure, until, after many trials, the cat will, when put in 
the box, immediately claw the button or loop in a definite way. (Thorndike 1898, p. 13) 

These and other experiments (some with dogs, chicks, monkeys, and even fish) led Thorndike to for¬ 
mulate a number of “laws” of learning, the most influential being the Law of Effect, a version of which 
we quoted in Chapter 1. This law describes what is generally known as learning by trial and error. 
As mentioned in Chapter 1, many aspects of the Law of Effect have generated controversy, and its 
details have been modified over the years. Still the law- in one form or another—expresses an enduring 
principle of learning. 

Essential features of reinforcement learning algorithms correspond to features of animal learning 
described by the Law of Effect. First, reinforcement learning algorithms are selectional, meaning that 
they try alternatives and select among them by comparing their consequences. Second, reinforcement 
learning algorithms are associative, meaning that the alternatives found by selection are associated with 
particular situations, or states, to form the agent’s policy. Like learning described by the Law of Effect, 
reinforcement learning is not just the process of finding actions that produce a lot of reward, but also of 
connecting these actions to situations or states. Thorndike used the phrase learning by “selecting and 
connecting” (Hilgard, 1956). Natural selection in evolution is a prime example of a selectional process, 
but it is not associative (at least as it is commonly understood); supervised learning is associative, 
but it is not selectional because it relies on instructions that directly tell the agent how to change its 
behavior. 

In computational terms, the Law of Effect describes an elementary way of combining search and 
memory: search in the form of trying and selecting among many actions in each situation, and memory 
in the form of associations linking situations with the actions found—so far—to work best in those 
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situations. Search and memory are essential components of all reinforcement learning algorithms, 
whether memory takes the form of an agent’s policy, value function, or environment model. 

A reinforcement learning algorithm’s need to search means that it has to explore in some way. Animals 
clearly explore as well, and early animal learning researchers disagreed about the degree of guidance an 
animal uses in selecting its actions in situations like Thorndike’s puzzle boxes. Are actions the result 
of “absolutely random, blind groping” (Woodworth, 1938, p. 777), or is there some degree of guidance, 
either from prior learning, reasoning, or other means? Although some thinkers, including Thorndike, 
seem to have taken the former position, others favored more deliberate exploration. Reinforcement 
learning algorithms allow wide latitude for how much guidance an agent can employ in selecting actions. 
The forms of exploration we have used in the algorithms presented in this book, such as e-greedy and 
upper-confidence-bound action selection, are merely among the simplest. More sophisticated methods 
are possible, with the only stipulation being that there has to be some form of exploration for the 
algorithms to work effectively. 

The feature of our treatment of reinforcement learning allowing the set of actions available at any 
time to depend on the environment’s current state echoes something Thorndike observed in his cats’ 
puzzle-box behaviors. The cats selected actions from those that they instinctively perform in their 
current situation, which Thorndike called their “instinctual impulses.” First placed in a puzzle box, a 
cat instinctively scratches, claws, and bites with great energy: a cat’s instinctual responses to finding 
itself in a confined space. Successful actions are selected from these and not from every possible action 
or activity. This is like the feature of our formalism where the action selected from a state s belongs to 
a set of admissible actions, A(s). Specifying these sets is an important aspect of reinforcement learning 
because it can radically simplify learning. They are like an animal’s instinctual impulses. On the other 
hand, Thorndike’s cats might have been exploring according to an instinctual context-specific ordering 
over actions rather than by just selecting from a set of instinctual impulses. This is another way to 
make reinforcement learning easier. 

Among the most prominent animal learning researchers influenced by the Law of Effect were Clark 
Hull (e.g., Hull, 1943) and B. F. Skinner (e.g., Skinner, 1938). At the center of their research was 
the idea of selecting behavior on the basis of its consequences. Reinforcement learning has features 
in common with Hull’s theory, which included eligibility-like mechanisms and secondary reinforcement 
to account for the ability to learn when there is a significant time interval between an action and the 
consequent reinforcing stimulus (see Section 14.4). Randomness also played a role in Hull’s theory 
through what he called “behavioral oscillation” to introduce exploratory behavior. 

Skinner did not fully subscribe to the memory aspect of the Law of Effect. Being averse to the 
idea of associative linkages, he instead emphasized selection from spontaneously-emitted behavior. He 
introduced the term “operant” to emphasize the key role of an action’s effects on an animal’s environ¬ 
ment. Unlike the experiments of Thorndike and others, which consisted of sequences of separate trials, 
Skinner’s operant conditioning experiments allowed animal subjects to behave for extended periods of 
time without interruption. He invented the operant conditioning chamber, now called a “Skinner box,” 
the most basic version of which contains a lever or key that an animal can press to obtain a reward, 
such as food or water, which would be delivered according to a well-defined rule, called a reinforcement 
schedule. By recording the cumulative number of lever presses as a function of time, Skinner and his 
followers could investigate the effect of different reinforcement schedules on the animal’s rate of lever¬ 
pressing. Modeling results from experiments likes these using the reinforcement learning principles we 
present in this book is not well developed, but we mention some exceptions in the Bibliographic and 
Historical Remarks section at the end of this chapter. 

Another of Skinner’s contributions resulted from his recognition of the effectiveness of training an 
animal by reinforcing successive approximations of the desired behavior, a process he called shaping. Al¬ 
though this technique had been used by others, including Skinner himself, its significance was impressed 
upon him when he and colleagues were attempting to train a pigeon to bowl by swiping a wooden ball 
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with its beak. After waiting for a long time without seeing any swipe that they could reinforce, they 

... decided to reinforce any response that had the slightest resemblance to a swipe—perhaps, 
at first, merely the behavior of looking at the ball -and then to select responses which more 
closely approximated the final form. The result amazed us. In a few minutes, the ball 
was caroming off the walls of the box as if the pigeon had been a champion squash player. 
(Skinner, 1958, p. 94) 

Not only did the pigeon learn a behavior that is unusual for pigeons, it learned quickly through an 
interactive process in which its behavior and the reinforcement contingencies changed in response to each 
other. Skinner compared the process of altering reinforcement contingencies to the work of a sculptor 
shaping clay into a desired form. Shaping is a powerful technique for computational reinforcement 
learning systems as well. When it is difficult for an agent to receive any non-zero reward signal at all, 
either due to sparseness of rewarding situations or their inaccessibility given initial behavior, starting 
with an easier problem and incrementally increasing its difficulty as the agent learns can be an effective, 
and sometimes indispensable, strategy. 

A concept from psychology that is especially relevant in the context of instrumental conditioning is 
motivation, which refers to processes that influence the direction and strength, or vigor, of behavior. 
Thorndike’s cats, for example, were motivated to escape from puzzle boxes because they wanted the 
food that was sitting just outside. Obtaining this goal was rewarding to them and reinforced the actions 
allowing them to escape. It is difficult to link the concept of motivation, which has many dimensions, 
in a precise way to reinforcement learning’s computational perspective, but there are clear links with 
some of its dimensions. 

In one sense, a reinforcement learning agent’s reward signal is at the base of its motivation: the agent 
is motivated to maximize the total reward it receives over the long run. A key facet of motivation, 
then, is what makes an agent’s experience rewarding. In reinforcement learning, reward signals depend 
on the state of the reinforcement learning agent’s environment and the agent’s actions. Further, as 
pointed out in Chapter 1, the state of the agent’s environment not only includes information about 
what is external to the machine, like an organism or a robot, that houses the agent, but also what 
is internal to this machine. Some internal state components correspond to what psychologists call an 
animal’s motivational state, which influences what is rewarding to the animal. For example, an animal 
will be more rewarded by eating when it is hungry than when it has just finished a satisfying meal. The 
concept of state dependence is broad enough to allow for many types of modulating influences on the 
generation of reward signals. 

Value functions provide a further link to psychologists’ concept of motivation. If the most basic 
motive for selecting an action is to obtain as much reward as possible, for a reinforcement learning 
agent that selects actions using a value function, a more proximal motive is to ascend the gradient of 
its value function, that is, to select actions expected to lead to the most highly-valued next states (or 
what is essentially the same thing, to select actions with the greatest act ion-values). For these agents, 
value functions are the main driving force determining the direction of their behavior. 

Another dimension of motivation is that an animal’s motivational state not only influences learning, 
but also influences the strength, or vigor, of the animal’s behavior after learning. For example, after 
learning to find food in the goal box of a maze, a hungry rat will run faster to the goal box than one 
that is not hungry. This aspect of motivation does not link so cleanly to the reinforcement learning 
framework we present here, but in the Bibliographical and Historical Remarks section at the end of this 
chapter we cite several publications that propose theories of behavioral vigor based on reinforcement 
learning. 

We turn now to the subject of learning when reinforcing stimuli occur well after the events they 
reinforce. The mechanisms used by reinforcement learning algorithms to enable learning with de¬ 
layed reinforcement—eligibility traces and TD learning—closely correspond to psychologists’ hypotheses 
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about how animals can learn under these conditions. 


14.4 Delayed Reinforcement 

The Law of Effect requires a backward effect on connections, and some early critics of the law could not 
conceive of how the present could affect something that was in the past. This concern was amplified 
by the fact that learning can even occur when there is a considerable delay between an action and the 
consequent reward or penalty. Similarly, in classical conditioning, learning can occur when US onset 
follows CS offset by a non-negligible time interval. We call this the problem of delayed reinforcement, 
which is related to what Minsky (1961) called the “credit-assignment problem for learning systems”: how 
do you distribute credit for success among the many decisions that may have been involved in producing 
it? The reinforcement learning algorithms presented in this book include two basic mechanisms for 
addressing this problem. The first is the use of eligibility traces, and the second is the use of TD 
methods to learn value functions that provide nearly immediate evaluations of actions (in tasks like 
instrumental conditioning experiments) or that provide immediate prediction targets (in tasks like 
classical conditioning experiments). Both of these methods correspond to similar mechanisms proposed 
in theories of animal learning. 

Pavlov (1927) pointed out that every stimulus must leave a trace in the nervous system that persists 
for some time after the stimulus ends, and he proposed that stimulus traces make learning possible 
when there is a temporal gap between the CS offset and the US onset. To this day, conditioning under 
these conditions is called trace conditioning (Figure 14.1). Assuming a trace of the CS remains when 
the US arrives, learning occurs through the simultaneous presence of the trace and the US. We discuss 
some proposals for trace mechanisms in the nervous system in Chapter 15. 

Stimulus traces were also proposed as a means for bridging the time interval between actions and 
consequent rewards or penalties in instrumental conditioning. In Hull’s influential learning theory, for 
example, “molar stimulus traces” accounted for what he called an animal’s goal gradient, a description 
of how the maximum strength of an instrumentally-conditioned response decreases with increasing delay 
of reinforcement (Hull, 1932, 1943). Hull hypothesized that an animal’s actions leave internal stimuli 
whose traces decay exponentially as functions of time since an action was taken. Looking at the animal 
learning data available at the time, he hypothesized that the traces effectively reach zero after 30 to 40 
seconds. 

The eligibility traces used in the algorithms described in this book are like Hull’s traces: they are 
decaying traces of past state visitations, or of past state-action pairs. Eligibility traces were introduced 
by Klopf (1972) in his neuronal theory in which they are temporally-extended traces of past activity at 
synapses, the connections between neurons. Klopf’s traces are more complex than the exponentially- 
decaying traces our algorithms use, and we discuss this more when we take up his theory in Section 15.9. 

To account for goal gradients that extend over longer time periods than spanned by stimulus traces, 
Hull (1943) proposed that longer gradients result from conditioned reinforcement passing backwards 
from the goal, a process acting in conjunction with his molar stimulus traces. Animal experiments 
showed that if conditions favor the development of conditioned reinforcement during a delay period, 
learning does not decrease with increased delay as much as it does under conditions that obstruct 
secondary reinforcement. Conditioned reinforcement is favored if there are stimuli that regularly occur 
during the delay interval. Then it is as if reward is not actually delayed because there is more immediate 
conditioned reinforcement. Hull therefore envisioned that there is a primary gradient based on the delay 
of the primary reinforcement mediated by stimulus traces, and that this is progressively modified, and 
lengthened, by conditioned reinforcement. 

Algorithms presented in this book that use both eligibility traces and value functions to enable 
learning with delayed reinforcement correspond to Hull’s hypothesis about how animals are able to 
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learn under these conditions. The actor-critic architecture discussed in Sections 13.5, 15.7, and 15.8 
illustrates this correspondence most clearly. The critic uses a TD algorithm to learn a value function 
associated with the system’s current behavior, that is, to predict the current policy’s return. The 
actor updates the current policy based on the critic’s predictions, or more exactly, on changes in the 
critic’s predictions. The TD error produced by the critic acts as a conditioned reinforcement signal 
for the actor, providing an immediate evaluation of performance even when the primary reward signal 
itself is considerably delayed. Algorithms that estimate action-value functions, such as Q-learning and 
Sarsa, similarly use TD learning principles to enable learning with delayed reinforcement by means 
of conditioned reinforcement. The close parallel between TD learning and the activity of dopamine 
producing neurons that we discuss in Chapter 15 lends additional support to links between reinforcement 
learning algorithms and this aspect of Hull’s learning theory. 


14.5 Cognitive Maps 

Model-based reinforcement learning algorithms use environment models that have elements in common 
with what psychologists call cognitive maps. Recall from our discussion of planning and learning in 
Chapter 8 that by an environment model we mean anything an agent can use to predict how its 
environment will respond to its actions in terms of state transitions and rewards, and by planning we 
mean any process that computes a policy from such a model. Environment models consist of two parts: 
the state-transition part encodes knowledge about the effect of actions on state changes, and the reward- 
model part encodes knowledge about the reward signals expected for each state or each state-action 
pair. A model-based algorithm selects actions by using a model to predict the consequences of possible 
courses of action in terms of future states and the reward signals expected to arise from those states. 
The simplest kind of planning is to compare the predicted consequences of collections of “imagined” 
sequences of decisions. 

Questions about whether or not animals use environment models, and if so, what are the models like 
and how are they learned, have played influential roles in the history of animal learning research. Some 
researchers challenged the then-prevailing stimulus-response (S-R) view of learning and behavior, which 
corresponds to the simplest model-free way of learning policies, by demonstrating latent learning. In the 
earliest latent learning experiment, two groups of rats were run in a maze. For the experimental group, 
there was no reward during the first stage of the experiment, but food was suddenly introduced into 
the goal box of the maze at the start of the second stage. For the control group, food was in the goal 
box throughout both stages. The question was whether or not rats in the experimental group would 
have learned anything during the first stage in the absence of food reward. Although the experimental 
rats did not appear to learn much during the first, unrewarded, stage, as soon as they discovered the 
food that was introduced in the second stage, they rapidly caught up with the rats in the control 
group. It was concluded that “during the non-reward period, the rats [in the experimental group] 
were developing a latent learning of the maze which they were able to utilize as soon as reward was 
introduced” (Blodgett, 1929). 

Latent learning is most closely associated with the psychologist Edward Tolman, who interpreted this 
result, and others like it, as showing that animals could learn a “cognitive map of the environment” in 
the absence of rewards or penalties, and that they could use the map later when they were motivated 
to reach a goal (Tolman, 1948). A cognitive map could also allow a rat to plan a route to the goal that 
was different from the route the rat had used in its initial exploration. Explanations of results like these 
led to the enduring controversy lying at the heart of the beliaviorist/cognitive dichotomy in psychology. 
In modern terms, cognitive maps are not restricted to models of spatial layouts but are more generally 
environment models, or models of an animal’s “task space” (e.g., Wilson, Takahashi, Schoenbaum, and 
Niv, 2014). The cognitive map explanation of latent learning experiments is analogous to the claim 
that animals use model-based algorithms, and that environment models can be learned even without 
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explicit rewards or penalties. Models are then used for planning when the animal is motivated by the 
appearance of rewards or penalties. 

Tolnran’s account of how animals learn cognitive maps was that they learn stimulus-stimulus, or S-S, 
associations by experiencing successions of stimuli as they explore an environment. In psychology this is 
called expectancy theory: given S-S associations, the occurrence of a stimulus generates an expectation 
about the stimulus to come next. This is much like what control engineers call system identification, 
in which a model of a system with unknown dynamics is learned from labeled training examples. In 
the simplest discrete-time versions, training examples are S-S' pairs, where S is a state and S', the 
subsequent state, is the label. When S is observed, the model creates the “expectation” that S' will 
be observed next. Models more useful for planning involve actions as well, so that examples look like 
SA-S', where S' is expected when action A is executed in state S. It is also useful to learn how the 
environment generates rewards. In this case, examples are of the form S —R or SA -R, where R is a 
reward signal associated with S or the SA pair. These are all forms of supervised learning by which 
an agent can acquire cognitive-like maps whether or not it receives any non-zero reward signals while 
exploring its environment. 


14.6 Habitual and Goal-directed Behavior 

The distinction between model-free and model-based reinforcement learning algorithms corresponds 
to the distinction psychologists make between habitual and goal-directed control of learned behavioral 
patterns. Habits are behavior patterns triggered by appropriate stimuli and then performed more-or- 
less automatically. Goal-directed behavior, according to how psychologists use the phrase, is purposeful 
in the sense that it is controlled by knowledge of the value of goals and the relationship between actions 
and their consequences. Habits are sometimes said to be controlled by antecedent stimuli, whereas goal- 
directed behavior is said to be controlled by its consequences (Dickinson, 1980, 1985). Goal-directed 
control has the advantage that it can rapidly change an animal’s behavior when the environment changes 
its way of reacting to the animal’s actions. While habitual behavior responds quickly to input from an 
accustomed environment, it is unable to quickly adjust to changes in the environment. The development 
of goal-directed behavioral control was likely a major advance in the evolution of animal intelligence. 

Figure 14.9 illustrates the difference between model-free and model-based decision strategies in a 
hypothetical task in which a rat has to navigate a maze that has distinctive goal boxes, each delivering 
an associated reward of the magnitude shown (Figure 14.9 top). Starting at Si, the rat has to first 
select left (L) or right (R) and then has to select L or R again at S 2 or S 3 to reach one of the goal boxes. 
The goal boxes are the terminal states of each episode of the rat’s episodic task. A model-free strategy 
(Figure 14.9 lower left) relies on stored values for state-action pairs. These action values (Q-values) 
are estimates of the highest return the rat can expect for each action taken from each (nonterminal) 
state. They are obtained over many trials of running the maze from start to finish. When the action 
values have become good enough estimates of the optimal returns, the rat just has to select at each 
state the action with the largest action value in order to make optimal decisions. In this case, when 
the action-value estimates become accurate enough, the rat selects L from Si and R from S 2 to obtain 
the maximum return of 4. A different model-free strategy might simply rely on a cached policy instead 
of action values, making direct links from Si to L and from S 2 to R. In neither of these strategies do 
decisions rely on an environment model. There is no need to consult a state-transition model, and no 
connection is required between the features of the goal boxes and the rewards they deliver. 

Figure 14.9 (lower right) illustrates a model-based strategy. It uses an environment model consisting 
of a state-transition model and a reward model. The state-transition model is shown as a decision tree, 
and the reward model associates the distinctive features of the goal boxes with the rewards to be found 
in each. (The rewards associated with states Si, S 2 , and S 3 are also part of the reward model, but here 
they are zero and are not shown.) A model-based agent can decide which way to turn at each state 
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Model-Free 


Model-Based 


Figure 14.9: Model-based and model-free strategies to solve a hypothetical sequential action-selection problem. 
Top: a rat navigates a maze with distinctive goal boxes, each associated with a reward having the value shown. 
Lower left: a model-free strategy relies on stored action values for all the state-action pairs obtained over many 
learning trials. To make decisions the rat just has to select at each state the action with the largest action 
value for that state. Lower right: in a model-based strategy, the rat learns an environment model, consisting 
of knowledge of state-action-next-state transitions and a reward model consisting of knowledge of the reward 
associated with each distinctive goal box. The rat can decide which way to turn at each state by using the 
model to simulate sequences of action choices to find a path yielding the highest return. Adapted from Trends in 
Cognitive Science, volume 10, number 8, Y. Niv, D. Joel, and P. Dayan, A Normative Perspective on Motivation, 
p. 376, 2006, with permission from Elsevier. 


by using the model to simulate sequences of action choices to find a path yielding the highest return. 
In this case the return is the reward obtained from the outcome at the end of the path. Here, with a 
sufficiently accurate model, the rat would select L and then R to obtain reward of 4. Comparing the 
predicted returns of simulated paths is a simple form of planning, which can be done in a variety of 
ways as discussed in Chapter 8. 

When the environment of a model-free agent changes the way it reacts to the agent’s actions, the 
agent has to acquire new experience in the changed environment during which it can update its policy 
and/or value function. In the model-free strategy shown in Figure 14.9 (lower left), for example, if 
one of the goal boxes were to somehow shift to delivering a different reward, the rat would have to 
traverse the maze, possibly many times, to experience the new reward upon reaching that goal box, all 
the while updating either its policy or its action-value function (or both) based on this experience. The 
key point is that for a model-free agent to change the action its policy specifies for a state, or to change 
an action value associated with a state, it has to move to that state, act from it, possibly many times, 
and experience the consequences of its actions. 
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A model-based agent can accommodate changes in its environment without this kind of ‘personal 
experience’ with the states and actions affected by the change. A change in its model automatically 
(through planning) changes its policy. Planning can determine the consequences of changes in the 
environment that have never been linked together in the agent’s own experience. For example, again 
referring to the maze task of Figure 14.9, imagine that a rat with a previously learned transition and 
reward model is placed directly in the goal box to the right of S 2 to find that the reward available 
there now has value 1 instead of 4. The rat’s reward model will change even though the action choices 
required to find that goal box in the maze were not involved. The planning process will bring knowledge 
of the new reward to bear on maze running without the need for additional experience in the maze; in 
this case changing the policy to right turns at both Si and S 3 to obtain a return of 3. 

Exactly this logic is the basis of outcome-devaluation experiments with animals. Results from these 
experiments provide insight into whether an animal has learned a habit or if its behavior is under 
goal-directed control. Outcome-devaluation experiments are like latent-learning experiments in that 
the reward changes from one stage to the next. After an initial rewarded stage of learning, the reward 
value of an outcome is changed, including being shifted to zero or even to a negative value. 

An early important experiment of this type was conducted by Adams and Dickinson (1981). They 
trained rats via instrumental conditioning until the rats energetically pressed a lever for sucrose pellets 
in a training chamber. The rats were then placed in the same chamber with the lever retracted and 
allowed non-contingent food, meaning that pellets were made available to them independently of their 
actions. After 15-minutes of this free-access to the pellets, rats in one group were injected with the 
nausea-inducing poison lithium chloride. This was repeated for three sessions, in the last of which none 
of the injected rats consumed any of the non-contingent pellets, indicating that the reward value of the 
pellets had been decreased—the pellets had been devalued. In the next stage taking place a day later, 
the rats were again placed in the chamber and given a session of extinction training, meaning that the 
response lever was back in place but disconnected from the pellet dispenser so that pressing it did not 
release pellets. The question was whether the rats that had the reward value of the pellets decreased 
would lever-press less than rats that did not have the reward value of the pellets decreased, even without 
experiencing the devalued reward as a result of lever-pressing. It turned out that the injected rats had 
significantly lower response rates than the non-injected rats right from the start of the extinction trials. 

Adams and Dickinson concluded that the injected rats associated lever pressing with consequent 
nausea by means of a cognitive map linking lever pressing to pellets, and pellets to nausea. Hence, in 
the extinction trials, the rats “knew” that the consequences of pressing the lever would be something 
they did not want, and so they reduced their lever-pressing right from the start. The important point 
is that they reduced lever-pressing without ever having experienced lever-pressing directly followed by 
being sick: no lever was present when they were made sick. They seemed able to combine knowledge 
of the outcome of a behavioral choice (pressing the lever will be followed by getting a pellet) with the 
reward value of the outcome (pellets are to be avoided) and hence could alter their behavior accordingly. 
Not every psychologist agrees with this “cognitive” account of this kind of experiment, and it is not the 
only possible way to explain these results, but the model-based planning explanation is widely accepted. 

Nothing prevents an agent from using both model-free and model-based algorithms, and there are 
good reasons for using both. We know from our own experience that with enough repetition, goal- 
directed behavior tends to turn into habitual behavior. Experiments show that this happens for rats 
too. Adams (1982) conducted an experiment to see if extended training would convert goal-directed 
behavior into habitual behavior. He did this by comparing the effect of outcome devaluation on rats 
that experienced different amounts of training. If extended training made the rats less sensitive to 
devaluation compared to rats that received less training, this would be evidence that extended training 
made the behavior more habitual. Adams’ experiment closely followed the Adams and Dickinson 
(1981) experiment just described. Simplifying a bit, rats in one group were trained until they made 
100 rewarded lever-presses, and rats in the other group—the overtrained group—were trained until 
they made 500 rewarded lever-presses. After this training, the reward value of the pellets was decreased 
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(using lithium chloride injections) for rats in both groups. Then both groups of rats were given a session 
of extinction training. Adams’ question was whether devaluation would effect the rate of lever-pressing 
for the overtrained rats less than it would for the non-overtrained rats, which would be evidence that 
extended training reduces sensitivity to outcome devaluation. It turned out that devaluation strongly 
decreased the lever-pressing rate of the non-overtrained rats. For the overtrained rats, in contrast, 
devaluation had little effect on their lever-pressing; in fact, if anything, it made it more vigorous. 
(The full experiment included control groups showing that the different amounts of training did not by 
themselves significantly effect lever-pressing rates after learning.) This result suggested that while the 
non-overtrained rats were acting in a goal-directed manner sensitive to their knowledge of the outcome 
of their actions, the overtrained rats had developed a lever-pressing habit. 

Viewing this and other results like it from a computational perspective provides insight as to why 
one might expect animals to behave habitually in some circumstances, in a goal-directed way in others, 
and why they shift from one mode of control to another as they continue to learn. While animals 
undoubtedly use algorithms that do not exactly match those we have presented in this book, one 
can gain insight into animal behavior by considering the tradeoffs that various reinforcement learning 
algorithms imply. An idea developed by computational neuroscientists Daw, Niv, and Dayan (2005) 
is that animals use both model-free and model-based processes. Each process proposes an action, and 
the action chosen for execution is the one proposed by the process judged to be the more trustworthy 
of the two as determined by measures of confidence that are maintained throughout learning. Early in 
learning the planning process of a model-based system is more trustworthy because it chains together 
short-term predictions which can become accurate with less experience than long-term predictions of the 
model-free process. But with continued experience, the model-free process becomes more trustworthy 
because planning is prone to making mistakes due to model inaccuracies and short-cuts necessary to 
make planning feasible, such as various forms of “tree-pruning”: the removal of unpromising search 
tree branches. According to this idea one would expect a shift from goal-directed behavior to habitual 
behavior as more experience accumulates. Other ideas have been proposed for how animals arbitrate 
between goal-directed and habitual control, and both behavioral and neuroscience research continues 
to examine this and related questions. 

The distinction between model-free and model-based algorithms is proving to be useful for this re¬ 
search. One can examine the computational implications of these types of algorithms in abstract settings 
that expose basic advantages and limitations of each type. This serves both to suggest and to sharpen 
questions that guide the design of experiments necessary for increasing psychologists’ understanding of 
habitual and goal-directed behavioral control. 


14.7 Summary 

Our goal in this chapter has been to discuss correspondences between reinforcement learning and the 
experimental study of animal learning in psychology. We emphasized at the outset that reinforcement 
learning as described in this book is not intended to model details of animal behavior. It is an abstract 
computational framework that explores idealized situations from the perspective of artificial intelligence 
and engineering. But many of the basic reinforcement learning algorithms were inspired by psychological 
theories, and in some cases, these algorithms have contributed to the development of new animal learning 
models. This chapter described the most conspicuous of these correspondences. 

The distinction in reinforcement learning between algorithms for prediction and algorithms for control 
parallels animal learning theory’s distinction between classical, or Pavlovian, conditioning and instru¬ 
mental conditioning. The key difference between instrumental and classical conditioning experiments 
is that in the former the reinforcing stimulus is contingent upon the animal’s behavior, whereas in 
the latter it is not. Learning to predict via a TD algorithm corresponds to classical conditioning, and 
we described the TD model of classical conditioning as one instance in which reinforcement learning 
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principles account for some details of animal learning behavior. This model generalizes the influen¬ 
tial Rescorla-Wagner model by including the temporal dimension where events within individual trials 
influence learning, and it provides an account of second-order conditioning, where predictors of rein¬ 
forcing stimuli become reinforcing themselves. It also is the basis of an influential view of the activity 
of dopamine neurons in the brain, something we take up in Chapter 15. 

Learning by trial and error is at the base of the control aspect of reinforcement learning. We presented 
some details about Thorndike’s experiments with cats and other animals that led to his Law of Effect, 
which we discussed here and in Chapter 1. We pointed out that in reinforcement learning, exploration 
does not have to be limited to “blind groping”; trials can be generated by sophisticated methods using 
innate and previously learned knowledge as long as there is some exploration. We discussed the training 
method B. F. Skinner called shaping in which reward contingencies are progressively altered to train 
an animal to successively approximate a desired behavior. Shaping is not only indispensable for animal 
training, it is also an effective tool for training reinforcement learning agents. There is also a connection 
to the idea of an animal’s motivational state, which influences what an animal will approach or avoid 
and what events are rewarding or punishing for the animal. 

The reinforcement learning algorithms presented in this book include two basic mechanisms for 
addressing the problem of delayed reinforcement: eligibility traces and value functions learned via TD 
algorithms. Both mechanisms have antecedents in theories of animal learning. Eligibility traces are 
similar to stimulus traces of early theories, and value functions correspond to the role of secondary 
reinforcement in providing nearly immediate evaluative feedback. 

The next correspondence the chapter addressed is that between reinforcement learning’s environment 
models and what psychologists call cognitive maps. Experiments conducted in the mid 20th century 
purported to demonstrate the ability of animals to learn cognitive maps as alternatives to, or as additions 
to, state-action associations, and later use them to guide behavior, especially when the environment 
changes unexpectedly. Environment models in reinforcement learning are like cognitive maps in that 
they can be learned by supervised learning methods without relying on reward signals, and then they 
can be used later to plan behavior. 

Reinforcement learning’s distinction between model-free and model-based algorithms corresponds to 
the distinction in psychology between habitual and goal-directed behavior. Model-free algorithms make 
decisions by accessing information that has been strored in a policy or an action-value function, whereas 
model-based methods select actions as the result of planning ahead using a model of the agent’s envi¬ 
ronment. Outcome-devaluation experiments provide information about whether an animal’s behavior 
is habitual or under goal-directed control. Reinforcement learning theory has helped clarify thinking 
about these issues. 

Animal learning clearly informs reinforcement learning, but as a type of machine learning, reinforce¬ 
ment learning is directed toward designing and understanding effective learning algorithms, not toward 
replicating or explaining details of animal behavior. We focused on aspects of animal learning that 
relate in clear ways to methods for solving prediction and control problems, highlighting the fruitful 
two-way flow of ideas between reinforcement learning and psychology without venturing deeply into 
many of the behavioral details and controversies that have occupied the attention of animal learning 
researchers. Future development of reinforcement learning theory and algorithms will likely exploit 
links to many other features of animal learning as the computational utility of these features becomes 
better appreciated. We expect that a flow of ideas between reinforcement learning and psychology will 
continue to bear fruit for both disciplines. 

Many connections between reinforcement learning and areas of psychology and other behavioral 
sciences are beyond the scope of this chapter. We largely omit discussing links to the psychology of 
decision making, which focuses on how actions are selected, or how decisions are made, after learning 
has taken place. We also do not discuss links to ecological and evolutionary aspects of behavior studied 
by ethologists and behavioral ecologists: how animals relate to one another and to their physical 
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surroundings, and how their behavior contributes to evolutionary fitness. Optimization, MDPs, and 
dynamic programming figure prominently in these fields, and our emphasis on agent interaction with 
dynamic environments connects to the study of agent behavior in complex “ecologies.” Multi-agent 
reinforcement learning, omitted in this book, has connections to social aspects of behavior. Despite 
the lack of treatment here, reinforcement learning should by no means be interpreted as dismissing 
evolutionary perspectives. Nothing about reinforcement learning implies a tabula rasa view of learning 
and behavior. Indeed, experience with engineering applications has highlighted the importance of 
building into reinforcement learning systems knowledge that is analogous to what evolution provides to 
animals. 


Bibliographical and Historical Remarks 

Ludvig, Bellemare, and Pearson (2011) and Shah (2012) review reinforcement learning in the contexts 

of psychology and neuroscience. These publications are useful companions to this chapter and the 

following chapter on reinforcement learning and neuroscience. 

14.1 Dayan, Niv, Seymour, and Daw (2006) focused on interactions between classical and instru¬ 
mental conditioning, particularly situations where classically-conditioned and instrumental re¬ 
sponses are in conflict. They proposed a Q-learning framework for modeling aspects of this 
interaction. Modayil and Sutton (2014) used a mobile robot to demonstrate the effectiveness of 
a control method combining a fixed response with online prediction learning. Calling this Pavlo- 
vian control , they emphasized that it differs from the usual control methods of reinforcement 
learning, being based on predictively executing fixed responses and not on reward maximiza¬ 
tion. The electro-mechanical machine of Ross (1933) and especially the learning version of 
Walter’s turtle (Walter, 1951) were very early illustrations of Pavlovian control. What is now 
called Pavlovian-instrumental transfer was first observed by Estes (1943, 1948). 

14 . 2.1 Kamin (1968) first reported blocking, now commonly known as Kamin blocking, in classical 
conditioning. Moore and Schmajuk (2008) provide an excellent summary of the blocking phe¬ 
nomenon, the research it stimulated, and its lasting influence on animal learning theory. Gibbs, 
Cool, Land, Kehoe, and Gormezano (1991) describe second-order conditioning of the rabbit’s 
nictitating membrane response and its relationship to conditioning with serial-compound stim¬ 
uli. Finch and Culler (1934) reported obtaining fifth-order conditioning of a dog’s foreleg 
withdrawal “when the motivation of the animal is maintained through the various orders.” 

14 . 2.2 The idea built into the Rescorla-Wagner model that learning occurs when animals are sur¬ 
prised is derived from Kamin (1969). Models of classical conditioning other than Rescorla and 
Wagner’s include the models of Klopf (1988), Grossberg (1975), Mackintosh (1975), Moore and 
Stickney (1980), Pearce and Hall (1980), and Courville, Daw, and Touretzky (2006). Schmajuk 
(2008) review models of classical conditioning. 

14 . 2.3 An early version of the TD model of classical conditioning appeared in Sutton and Barto (1981), 
which also included the early model’s prediction that temporal primacy overrides blocking, later 
shown by Kehoe, Scheurs, and Graham (1987) to occur in the rabbit nictitating membrane 
preparation. Sutton and Barto (1981) contains the earliest recognition of the near identity be¬ 
tween the Rescorla-Wagner model and the Least-Mean-Square (LMS), or Widrow-Hoff, learning 
rule (Widrow and Hoff, 1960). This early model was revised following Sutton’s development 
of the TD algorithm (Sutton, 1984, 1988) and was first presented as the TD model in Sutton 
and Barto (1987) and more completely in Sutton and Barto (1990), upon which this section is 
largely based. Additional exploration of the TD model and its possible neural implementation 
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was conducted by Moore and colleagues (Moore, Desmond, Berthier, Blazis, Sutton, and Barto, 
1986; Moore and Blazis, 1989; Moore, Choi, and Brunzell, 1998; Moore, Marks, Castagna, and 
Polewan, 2001). Klopf’s (1988) drive-reinforcement theory of classical conditioning extends 
the TD model to address additional experimental details, such as the S-shape of acquisition 
curves. In some of these publications TD is taken to mean Time Derivative instead of Temporal 
Difference. 

14 . 2.4 Ludvig, Sutton, and Kehoe (2012) evaluated the performance of the TD model in previously 
unexplored tasks involving classical conditioning and examined the influence of various stim¬ 
ulus representations, including the microstimulus representation that they introduced earlier 
(Ludvig, Sutton, and Kehoe, 2008). Earlier investigations of the influence of various stimulus 
representations and their possible neural implementations on response timing and topography 
in the context of the TD model are those of Moore and colleagues cited above. Although not in 
the context of the TD model, representations like the microstimulus representation of Ludvig et 
al. (2012) have been proposed and studied by Grossberg and Schmajuk (1989), Brown, Bullock, 
and Grossberg (1999), Buhusi and Schmajuk (1999), and Machado (1997). 

14.4 Section 1.7 includes comments on the history of trial-and-error learning and the Law of Effect. 
The idea that Thorndikes cats might have been exploring according to an instinctual context- 
specific ordering over actions rather than by just selecting from a set of instinctual impulses 
was suggested by Peter Dayan (personal communication). Selfridge, Sutton, and Barto (1985) 
illustrated the effectiveness of shaping in a pole-balancing reinforcement learning task. Other 
examples of shaping in reinforcement learning are Gullapalli and Barto (1992), Mahadevan 
and Connell (1992), Mataric (1994), Dorigo and Colombette (1994), Saksida, Raymond, and 
Touretzky (1997), and Randlpv and Alstrpm (1998). Ng (2003) and Ng, Harada, and Russell 
(1999) used the term shaping in a sense somewhat different from Skinner’s, focussing on the 
problem of how to alter the reward signal without altering the set of optimal policies. 

Dickinson and Balleine (2002) discuss the complexity of the interaction between learning and 
motivation. Wise (2004) provides an overview of reinforcement learning and its relation to 
motivation. Daw and Shohamy (2008) link motivation and learning to aspects of reinforcement 
learning theory. See also McClure, Daw, and Montague (2003), Niv, Joel, and Dayan (2006), 
Rangel et al. (2008), and Dayan and Berridge (2014). McClure et al. (2003), Niv, Daw, and 
Dayan (2005), and Niv, Daw, Joel, and Dayan (2007) present theories of behavioral vigor 
related to the reinforcement learning framework. 

14.4 Spence, Hull’s student and collaborator at Yale, elaborated the role of higher-order reinforce¬ 
ment in addressing the problem of delayed reinforcement (Spence, 1947). Learning over very 
long delays, as in taste-aversion conditioning with delays up to several hours, led to interference 
theories as alternatives to decaying-trace theories (e.g., Revusky and Garcia, 1970; Boakes and 
Costa, 2014). Other views of learning under delayed reinforcement invoke roles for awareness 
and working memory (e.g., Clark and Squire, 1998; Seo, Barraclough, and Lee, 2007). 

14.5 Tliistlethwaite (1951) is an extensive review of latent learning experiments up to the time of its 
publication. Ljung (1998) is an overview of model learning, or system identification, techniques 
in engineering. Gopnik, Glymour, Sobel, Schulz, Kushnir, and Danks (2004) present a Bayesian 
theory about how children learn models. 

Connections between habitual and goal-directed behavior and model-free and model-based re¬ 
inforcement learning were first proposed by Daw, Niv, and Dayan (2005). The hypothetical 
maze task used to explain habitual and goal-directed behavioral control is based on the expla¬ 
nation of Niv, Joel, and Dayan (2006). Dolan and Dayan (2013) review four generations of 
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experimental research related to this issue and discuss how it can move forward on the basis of 
reinforcement learning’s model-free/model-based distinction. Dickinson (1980, 1985) and Dick¬ 
inson and Balleine (2002) discuss experimental evidence related to this distinction. Donahoe 
and Burgos (2000) alternatively argue that model-free processes can account for the results of 
outcome-devaluation experiments. Dayan and Berridge (2014) argue that classical conditioning 
involves model-based processes. Rangel, Camerer, and Montague (2008) review many of the 
outstanding issues involving habitual, goal-directed, and Pavlovian modes of control. 

Comments on Terminology - The traditional meaning of reinforcement in psychology is the strength¬ 
ening of a pattern of behavior (by increasing either its intensity or frequency) as a result of an animal 
receiving a stimulus (or experiencing the omission of a stimulus) in an appropriate temporal relation¬ 
ship with another stimulus or with a response. Reinforcement produces changes that remain in future 
behavior. Sometimes in psychology reinforcement refers to the process of producing lasting changes in 
behavior, whether the changes strengthen or weaken a behavior pattern (Mackintosh, 1983). Letting 
reinforcement refer to weakening in addition to strengthening is at odds with the everyday meaning of 
reinforce, and its traditional use in psychology, but it is a useful extension that we have adopted here. 
In either case, a stimulus considered to be the cause of the behavioral change is called a reinforcer. 

Psychologists do not generally use the specific phrase reinforcement learning as we do. Animal 
learning pioneers probably regarded reinforcement and learning as being synonymous, so it would be 
redundant to use both words. Our use of the phrase follows its use in computational and engineering re¬ 
search, influenced mostly by Minsky (1961). But the phrase is lately gaining currency in psychology and 
neuroscience, likely because strong parallels have surfaced between reinforcement learning algorithms 
and animal learning--parallels described in this chapter and the next. 

According to common usage, a reward is an object or event that an animal will approach and work 
for. A reward may be given to an animal in recognition of its ‘good’ behavior, or given in order to make 
the animal’s behavior ‘better.’ Similarly, a penalty is an object or event that the animal usually avoids 
and that is given as a consequence of ‘bad’ behavior, usually in order to change that behavior. Primary 
reward is reward due to machinery built into an animal’s nervous system by evolution to improve its 
chances of survival and reproduction, e.g., reward produced by the taste of nourishing food, sexual 
contact, successful escape, and many other stimuli and events that predicted reproductive success over 
the animal’s ancestral history. As explained in Section 14.2.1, higher-order reward is reward delivered 
by stimuli that predict primary reward, either directly or indirectly by predicting other stimuli that 
predict primary reward. Reward is secondary if its rewarding quality is the result of directly predicting 
primary reward. 

I this book we call R t the ‘reward signal at time V or sometimes just the ‘reward at time f,’ but we 
do not think of it as an object or event in the agent’s environment. Because Rt is a number—not an 
object or an event—it is more like a reward signal in neuroscience, which is a signal internal to the 
brain, like the activity of neurons, that influences decision making and learning. This signal might be 
triggered when the animal perceives an attractive (or an aversive) object, but it can also be triggered 
by things that do not physically exist in the animal’s external environment, such as memories, ideas, or 
hallucinations. Because our R t can be positive, negative, or zero, it might be better to call a negative 
Rt a penalty, and an R t equal to zero a neutral signal, but for simplicity we generally avoid these terms. 

In reinforcement learning, the process that generates all the R t s defines the problem the agent is 
trying to solve. The agent’s objective is to keep the magnitude of R t as large as possible over time. In 
this respect, Rt is like primary reward for an animal if we think of the problem the animal faces as the 
problem of obtaining as much primary reward as possible over its lifetime (and thereby, through the 
prospective “wisdom” of evolution, improve its chances of solving its real problem, which is to pass its 
genes on to future generations). However, as we suggest in Chapter 15, it is unlikely that there is a 
single “master” reward signal like Rt in an animal’s brain. 

Not all reinforcers are rewards or penalties. Sometimes reinforcement is not the result of an animal 
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receiving a stimulus that evaluates its behavior by labeling the behavior good or bad. A behavior 
pattern can be reinforced by a stimulus that arrives to an animal no matter how the animal behaved. As 
described in Section 14.1, whether the delivery of reinforcer depends, or does not depend, on preceding 
behavior is the defining difference between instrumental, or operant, conditioning experiments and 
classical, or Pavlovian, conditioning experiments. Reinforcement is at work in both types of experiments, 
but only in the former is it feedback that evaluates past behavior. (Though it has often been pointed 
out that even when the reinforcing US in a classical conditioning experiment is not contingent on the 
subject’s preceding behavior, its reinforcing value can be influenced by this behavior, an example being 
that a closed eye makes an air puff to the eye less aversive.) 

The distinction between reward signals and reinforcement signals is a crucial point when we discuss 
neural correlates of these signals in the next chapter. Like a reward signal, for us, the reinforcement 
signal at any specific time is a positive or negative number, or zero. A reinforcement signal is the major 
factor directing changes a learning algorithm makes in an agent’s policy, value estimates, or environment 
models. The definition that makes the most sense to us is that a reinforcement signal at any time is a 
number that multiplies (possibly along with some constants) a vector to determine parameter updates 
in some learning algorithm. 

For some algorithms, the reward signal alone is the critical multiplier in the parameter-update equa¬ 
tion. For these algorithms the reinforcement signal is the same as the reward signal. But for most of 
the algorithms we discuss in this book, reinforcement signals include terms in addition to the reward 
signal, an example being a TD error St = Rt+i +7U(<S't+i) — V(St), which is the reinforcement signal 
for TD state-value learning (and analogous TD errors for action-value learning). In this reinforcement 
signal, Rt+i is the primary reinforcement contribution, and the temporal difference in predicted values, 
jV(St+i) — V(St) (or an analogous temporal difference for action values), is the conditioned reinforce¬ 
ment contribution. Thus, whenever 7U(5t+i) — V(St) = 0, St signals ‘pure’ primary reinforcement; and 
whenever R t+ \ = 0, it signals ‘pure’ conditioned reinforcement, but it often signals a mixture of these. 
Note as we mentioned in Section 6.1, this S t is not available until time t + 1. We therefore think of S t as 
the reinforcement signal at time t+1, which is fitting because it reinforces predictions and/or actions 
made earlier at step t. 

A possible source of confusion is the terminology used by the famous psychologist B.F. Skinner and 
his followers. For Skinner, positive reinforcement occurs when the consequences of an animal’s behavior 
increase the frequency of that behavior; punishment occurs when the behavior’s consequences decrease 
that behavior’s frequency. Negative reinforcement occurs when behavior leads to the removal of an 
aversive stimulus (that is, a stimulus the animal does not like), thereby increasing the frequency of 
that behavior. Negative punishment, on the other hand, occurs when behavior leads to the removal of 
an appetitive stimulus (that is, a stimulus the animal likes), thereby decreasing the frequency of that 
behavior. We find no critical need for these distinctions because our approach is more abstract than 
this, with both reward and reinforcement signals allowed to take on both positive and negative values. 
(But note especially that when our reinforcement signal is negative, it is not the same as Skinner’s 
negative reinforcement.) 

On the other hand, it has often been pointed out that using a single number as a reward or a 
penalty signal, depending only on its sign, is at odds with the fact that animals’ appetitive and aversive 
systems have qualitatively different properties and involve different brain mechanisms. This points to 
a direction in which the reinforcement learning framework might be developed in the future to exploit 
computational advantages of separate appetitive and aversive systems, but for now we are passing over 
these possibilities. 

Another discrepancy in terminology is how we use the word action. To many cognitive scientists, an 
action is purposeful in the sense of being the result of an animal’s knowledge about the relationship 
between the behavior in question and the consequences of that behavior. An action is goal-directed and 
the result of a decision, in contrast to a response, which is triggered by a stimulus; the result of a reflex or 
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a habit. We use the word action without differentiating among what others call actions, decisions, and 
responses. These are important distinctions, but for us they are encompassed by differences between 
model-free and model-based reinforcement learning algorithms, which we discussed above in relation to 
habitual and goal-directed behavior in Section 14.6. Dickinson (1985) discusses the distinction between 
responses and actions. 

A term used a lot in this book is control. What we mean by control is entirely different from what it 
means to animal learning psychologists. By control we mean that an agent influences its environment to 
bring about states or events that the agent prefers: the agent exerts control over its environment. This 
is the sense of control used by control engineers. In psychology, on the other hand, control typically 
means that an animal’s behavior is influenced by—is controlled by—the stimuli the animal receives 
(stimulus control) or the reinforcement schedule it experiences. Here the environment is controlling the 
agent. Control in this sense is the basis of behavior modification therapy. Of course, both of these 
directions of control are at play when an agent interacts with its environment, but our focus is on the 
agent as controller; not the environment as controller. A view equivalent to ours, and perhaps more 
illuminating, is that the agent is actually controlling the input it receives from its environment (Powers, 
1973). This is not what psychologists mean by stimulus control. 

Sometimes reinforcement learning is understood to refer solely to learning policies directly from 
rewards (and penalties) without the involvement of value functions or environment models. This is 
what psychologists call stimulus-response, or S-R, learning. But for us, along with most of today’s 
psychologists, reinforcement learning is much broader than this, including in addition to S-R learn¬ 
ing, methods involving value functions, environment models, planning, and other processes that are 
commonly thought to belong to the more cognitive side of mental functioning. 



Chapter 15 


Neuroscience 


Neuroscience is the multidisciplinary study of nervous systems: how they regulate bodily functions; 
control behavior; change over time as a result of development, learning, and aging; and how cellular and 
molecular mechanisms make these functions possible. One of the most exciting aspects of reinforcement 
learning is the mounting evidence from neuroscience that the nervous systems of humans and many other 
animals implement algorithms that correspond in striking ways to reinforcement learning algorithms. 
The main objective of this chapter is to explain these parallels and what they suggest about the neural 
basis of reward-related learning in animals. 

The most remarkable point of contact between reinforcement learning and neuroscience involves 
dopamine, a chemical deeply involved in reward processing in the brains of mammals. Dopamine 
appears to convey temporal-difference (TD) errors to brain structures where learning and decision 
making take place. This parallel is expressed by the reward prediction error hypothesis of dopamine 
neuron activity, a hypothesis that resulted from the convergence of computational reinforcement learning 
and results of neuroscience experiments. In this chapter we discuss this hypothesis, the neuroscience 
findings that led to it, and why it is a significant contribution to understanding brain reward systems. 
We also discuss parallels between reinforcement learning and neuroscience that are less striking than 
this dopamine/TD-error parallel but that provide useful conceptual tools for thinking about reward- 
based learning in animals. Other elements of reinforcement learning have the potential to impact the 
study of nervous systems, but their connections to neuroscience are still relatively undeveloped. We 
discuss several of these evolving connections that we think will grow in importance over time. 

As we outlined in the history section of this book’s introductory chapter (Section 1.7), many aspects of 
reinforcement learning were influenced by neuroscience. A second objective of this chapter is to acquaint 
readers with ideas about brain function that have contributed to our approach to reinforcement learning. 
Some elements of reinforcement learning are easier to understand when seen in light of theories of brain 
function. This is particularly true for the idea of the eligibility trace, one of the basic mechanisms of 
reinforcement learning, that originated as a conjectured property of synapses, the structures by which 
nerve cells—neurons—communicate with one another. 

In this chapter we do not delve very deeply into the enormous complexity of the neural systems 
underlying reward-based learning in animals: this chapter is too short, and we are not neuroscientists. 
We do not try to describe—or even to name—the very many brain structures and pathways, or any 
of the molecular mechanisms, believed to be involved in these processes. We also do not do justice to 
hypotheses and models that are alternatives to those that align so well with reinforcement learning. It 
should not be surprising that there are differing views among experts in the field. We can only provide 
a glimpse into this fascinating and developing story. We hope, though, that this chapter convinces 
you that a very fruitful channel has emerged connecting reinforcement learning and its theoretical 
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underpinnings to the neuroscience of reward-based learning in animals. 

Many excellent publications cover links between reinforcement learning and neuroscience, some of 
which we cite in this chapter’s final section. Our treatment differs from most of these because we assume 
familiarity with reinforcement learning as presented in the earlier chapters of this book, but we do not 
assume knowledge of neuroscience. We begin with a brief introduction to the neuroscience concepts 
needed for a basic understanding of what is to follow. 


15.1 Neuroscience Basics 

Some basic information about nervous systems is helpful for following what we cover in this chapter. 
Terms that we refer to later are italicized. Skipping this section will not be a problem if you already 
have an elementary knowledge of neuroscience. 

Neurons , the main components of nervous systems, are cells specialized for processing and transmit¬ 
ting information using electrical and chemical signals. They come in many forms, but a neuron typically 
has a cell body, dendrites, and a single axon. Dendrites are structures that branch from the cell body 
to receive input from other neurons (or to also receive external signals in the case of sensory neurons). 
A neuron’s axon is a fiber that carries the neuron’s output to other neurons (or to muscles or glands). 
A neuron’s output consists of sequences of electrical pulses called action potentials that travel along the 
axon. Action potentials are also called spikes, and a neuron is said to fire when it generates a spike. 
In models of neural networks it is common to use real numbers to represent a neuron’s firing rate, the 
average number of spikes per some unit of time. 

A neuron’s axon can branch widely so that the neuron’s action potentials reach many targets. The 
branching structure of a neuron’s axon is called the neuron’s axonal arbor. Because the conduction 
of an action potential is an active process, not unlike the burning of a fuse, when an action potential 
reaches an axonal branch point it “lights up” action potentials on all of the outgoing branches (although 
propagation to a branch can sometimes fail). As a result, the activity of a neuron with a large axonal 
arbor can influence many target sites. 

A synapse is a structure generally at the termination of an axon branch that mediates the com¬ 
munication of one neuron to another. A synapse transmits information from the presynaptic neuron’s 
axon to a dendrite or cell body of the postsynaptic neuron. With a few exceptions, synapses release a 
chemical neurotransmitter upon the arrival of an action potential from the presynaptic neuron. (The 
exceptions are cases of direct electric coupling between neurons, but these will not concern us here.) 
Neurotransmitter molecules released from the presynaptic side of the synapse diffuse across the synaptic 
cleft, the very small space between the presynaptic ending and the postsynaptic neuron, and then bind 
to receptors on the surface of the postsynaptic neuron to excite or inhibit its spike-generating activity, 
or to modulate its behavior in other ways. A particular neurotransmitter may bind to several different 
types of receptors, with each producing a different effect on the postsynaptic neuron. For example, 
there are at least five different receptor types by which the neurotransmitter dopamine can affect a 
postsynaptic neuron. Many different chemicals have been identified as neurotransmitters in animal 
nervous systems. 

A neuron’s background activity is its level of activity, usually its firing rate, when the neuron does 
not appear to be driven by synaptic input related to the task of interest to the experimenter, for 
example, when the neuron’s activity is not correlated with a stimulus delivered to a subject as part 
of an experiment. Background activity can be irregular due to input from the wider network, or due 
to noise within the neuron or its synapses. Sometimes background activity is the result of dynamic 
processes intrinsic to the neuron. A neuron’s phasic activity, in contrast to its background activity, 
consists of bursts of spiking activity usually caused by synaptic input. Activity that varies slowly and 
often in a graded manner, whether as background activity or not, is called a neuron’s tonic activity. 
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The strength or effectiveness by which the neurotransmitter released at a synapse influences the post- 
synaptic neuron is the synapse’s efficacy. One way a nervous system can change through experience is 
through changes in synaptic efficacies as a result of combinations of the activities of the presynaptic and 
postsynaptic neurons, and sometimes by the presence of a neuromodulator , which is a neurotransmitter 
having effects other than, or in addition to, direct fast excitation or inhibition. 

Brains contain several different neuromodulation systems consisting of clusters of neurons with widely 
branching axonal arbors, with each system using a different neurotransmitter. Neuromodulation can 
alter the function of neural circuits, mediate motivation, arousal, attention, memory, mood, emotion, 
sleep, and body temperature. Important here is that a neuromodulatory system can distribute some¬ 
thing like a scalar signal, such as a reinforcement signal, to alter the operation of synapses in widely 
distributed sites critical for learning. 

The ability of synaptic efficacies to change is called synaptic plasticity. It is one of the primary mecha¬ 
nisms responsible for learning. The parameters, or weights, adjusted by learning algorithms correspond 
to synaptic efficacies. As we detail below, modulation of synaptic plasticity via the neuromodulator 
dopamine is a plausible mechanism for how the brain might implement learning algorithms like many 
of those described in this book. 


15.2 Reward Signals, Reinforcement Signals, Values, and Pre¬ 
diction Errors 

Links between neuroscience and computational reinforcement learning begin as parallels between signals 
in the brain and signals playing prominent roles in reinforcement learning theory and algorithms. In 
Chapter 3 we said that any problem of learning goal-directed behavior can be reduced to the three 
signals representing actions, states, and rewards. However, to explain links that have been made 
between neuroscience and reinforcement learning, we have to be less abstract than this and consider 
other reinforcement learning signals that correspond, in certain ways, to signals in the brain. In addition 
to reward signals, these include reinforcement signals (which we argue are different from reward signals), 
value signals, and signals conveying prediction errors. When we label a signal by its function in this 
way, we are doing it in the context of reinforcement learning theory in which the signal corresponds to 
a term in an equation or an algorithm. On the other hand, when we refer to a signal in the brain, we 
mean a physiological event such as a burst of action potentials or the secretion of a neurotransmitter. 
Labeling a neural signal by its function, for example calling the phasic activity of a dopamine neuron a 
reinforcement signal, means that the neural signal behaves like, and is conjectured to function like, the 
corresponding theoretical signal. 

Uncovering evidence for these correspondences involves many challenges. Neural activity related to 
reward processing can be found in nearly every part of the brain, and it is difficult to interpret results 
unambiguously because representations of different reward-related signals tend to be highly correlated 
with one another. Experiments need to be carefully designed to allow one type of reward-related signal 
to be distinguished with any degree of certainty from others—or from an abundance of other signals not 
related to reward processing. Despite these difficulties, many experiments have been conducted with 
the aim of reconciling aspects of reinforcement learning theory and algorithms with neural signals, and 
some compelling links have been established. To prepare for examining these links, in the rest of this 
section we remind the reader of what various reward-related signals mean according to reinforcement 
learning theory. 

In our Comments on Terminology at the end of the previous chapter, we said that Rt is like a reward 
signal in an animal’s brain and not as an object or event in the animal’s environment. In reinforcement 
learning, the reward signal (along with an agent’s environment) defines the problem a reinforcement 
learning agent is trying to solve. It this respect, R t is like a signal in an animal’s brain that distributes 
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primary reward to sites throughout the brain. But it is unlikely that a unitary master reward signal 
like Rt exists in an animal’s brain. It is best to think of Rt as an abstraction summarizing the overall 
effect of a multitude of neural signals generated by many systems in the brain that assess the rewarding 
or punishing qualities of sensations and states. 

Reinforcement signals in reinforcement learning are different from reward signals. The function of 
a reinforcement signal is to direct the changes a learning algorithm makes in an agent’s policy, value 
estimates, or environment models. For a TD method, for instance, the reinforcement signal at time t 
is the TD error 5 t _i = R t + jV(St) — U(<S' t _i). 1 The reinforcement signal for some algorithms could 
be just the reward signal, but for most of the algorithms we consider the reinforcement signal is the 
reward signal adjusted by other information, such as the value estimates in TD errors. 

Estimates of state values or of action values, that is, V or Q, specify what is good or bad for the agent 
over the long run. They are predictions of the total reward an agent can expect to accumulate over the 
future. Agents make good decisions by selecting actions leading to states with the largest estimated 
state values, or by selecting actions with the largest estimated action values. 

Prediction errors measure discrepancies between expected and actual signals or sensations. Reward 
prediction errors (RPEs) specifically measure discrepancies between the expected and the received re¬ 
ward signal, being positive when the reward signal is greater than expected, and negative otherwise. 
TD errors like (6.5) are special kinds RPEs that signal discrepancies between current and earlier expec¬ 
tations of reward over the long-term. When neuroscientists refer to RPEs they generally (though not 
always) mean TD RPEs, which we simply call TD errors throughout this chapter. Also in this chapter, 
a TD error is generally one that does not depend on actions, as opposed to TD errors used in learning 
action-values by algorithms like Sarsa and Q-learning. This is because the most well-known links to 
neuroscience are stated in terms of action-free TD errors, but we do not mean to rule out possible sim¬ 
ilar links involving action-dependent TD errors. (TD errors for predicting signals other than rewards 
are useful too, but that case will not concern us here. See, for example, Modayil, White, and Sutton, 
2014.) 

One can ask many questions about links between neuroscience data and these theoretically-defined 
signals. Is an observed signal more like a reward signal, a value signal, a prediction error, a reinforcement 
signal, or something altogether different? And if it is an error signal, is it an RPE, a TD error, or a 
simpler error like the Rescorla-Wagner error (14.3)? And if it is a TD error, does it depend on actions like 
the TD error of Q-learning or Sarsa? As indicated above, probing the brain to answer questions like these 
is extremely difficult. But experimental evidence suggests that one neurotransmitter, specifically the 
neurotransmitter dopamine, signals RPEs, and further, that the phasic activity of dopamine-producing 
neurons in fact conveys TD errors (see Section 15.1 for a definition of phasic activity). This evidence 
led to the reward prediction error hypothesis of dopamine neuron activity , which we describe next. 


15.3 The Reward Prediction Error Hypothesis 

The reward prediction error hypothesis of dopamine neuron activity proposes that one of the functions of 
the phasic activity of dopamine-producing neurons in mammals is to deliver an error between an old and 
a new estimate of expected future reward to target areas throughout the brain. This hypothesis (though 
not in these exact words) was first explicitly stated by Montague, Dayan, and Sejnowski (1996), who 
showed how the TD error concept from reinforcement learning accounts for many features of the phasic 
activity of dopamine neurons in mammals. The experiments that led to this hypothesis were performed 
in the 1980s and early 1990s in the laboratory of neuroscientist Wolfram Schultz. Section 15.5 describes 

1 As we mentioned in Section 6.1, <5t in our notation is defined to be Rt+i + yld-SW 1 ) — V(St), so St is not available 
until time t + 1 . The TD error available at t is actually Si- i = Rt + 7 V(St) — V(St— l). Since we are thinking of time 
steps as very small, or even infinitesimal, time intervals, one should not attribute undue importance to this one-step time 
shift. 
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these influential experiments, Section 15.6 explains how the results of these experiments align with TD 
errors, and the Bibliographical and Historical Remarks section at the end of this chapter includes a 
guide to the literature surrounding the development of this influential hypothesis. 

Montague et al. (1996) compared the TD errors of the TD model of classical conditioning with the 
phasic activity of dopamine-producing neurons during classical conditioning experiments. Recall from 
Section 14.2 that the TD model of classical conditioning is basically the semi-gradient-descent TD(A) 
algorithm with linear function approximation. Montague et al. made several assumptions to set up this 
comparison. First, since a TD error can be negative but neurons cannot have a negative firing rate, 
they assumed that the quantity corresponding to dopamine neuron activity is St -1 + bt, where bt is 
the background firing rate of the neuron. A negative TD error corresponds to a drop in a dopamine 
neuron’s firing rate below its background rate. 2 

A second assumption was needed about the states visited in each classical conditioning trial and 
how they are represented as inputs to the learning algorithm. This is the same issue we discussed in 
Section 14.2.4 for the TD model. Montague et al. chose a complete serial compound (CSC) represen¬ 
tation as shown in the left column of Figure 14.2, but where the sequence of short-duration internal 
signals continues until the onset of the US, which here is the arrival of a non-zero reward signal. This 
representation allows the TD error to mimic the fact that dopamine neuron activity not only predicts 
a future reward, but that it is also sensitive to when after a predictive cue that reward is expected 
to arrive. There has to be some way to keep track of the time between sensory cues and the arrival 
of reward. If a stimulus initiates a sequence of internal signals that continues after the stimulus ends, 
and if there is a different signal for each time step following the stimulus, then each time step after the 
stimulus is represented by a distinct state. Thus, the TD error, being state-dependent, can be sensitive 
to the timing of events within a trial. 

In simulated trials with these assumptions about background firing rate and input representation, 
TD errors of the TD model are remarkably similar to dopamine neuron phasic activity. Previewing our 
description of details about these similarities in Section 15.5 below, the TD errors parallel the following 
features of dopamine neuron activity: 1) the phasic response of a dopamine neuron only occurs when 
a rewarding event is unpredicted; 2) early in learning, neutral cues that precede a reward do not cause 
substantial phasic dopamine responses, but with continued learning these cues gain predictive value 
and come to elicit phasic dopamine responses; 3) if an even earlier cue reliably precedes a cue that has 
already acquired predictive value, the phasic dopamine response shifts to the earlier cue, ceasing for 
the later cue; and 3) if after learning, the predicted rewarding event is omitted, a dopamine neuron’s 
response decreases below its baseline level shortly after the expected time of the rewarding event. 

Although not every dopamine neuron monitored in the experiments of Schultz and colleagues behaved 
in all of these ways, the striking correspondence between the activities of most of the monitored neurons 
and TD errors lends strong support to the reward prediction error hypothesis. There are situations, 
however, in which predictions based on the hypothesis do not match what is observed in experiments. 
The choice of input representation is critical to how closely TD errors match some of the details of 
dopamine neuron activity, particularly details about the timing of dopamine neuron responses. Different 
ideas, some of which we discuss below, have been proposed about input representations and other 
features of TD learning to make the TD errors fit the data better, though the main parallels appear 
with the CSC representation that Montague et al. used. Overall, the reward prediction error hypothesis 
has received wide acceptance among neuroscientists studying reward-based learning, and it has proven 
to be remarkably resilient in the face of accumulating results from neuroscience experiments. 

To prepare for our description of the neuroscience experiments supporting the reward prediction error 
hypothesis, and to provide some context so that the significance of the hypothesis can be appreciated, 
we next present some of what is known about dopamine, the brain structures it influences, and how it 


“In the literature relating TD errors to the activity of dopamine neurons, their St is the same as our <5t_i = Rt + 
lV(St) - V(St-i). 
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is involved in reward-based learning. 


15.4 Dopamine 

Dopamine is produced as a neurotransmitter by neurons whose cell bodies lie mainly in two clusters 
of neurons in the midbrain of mammals: the substantia nigra pars compacta (SNpc) and the ventral 
tegmental area (VTA). Dopamine plays essential roles in many processes in the mammalian brain. 
Prominent among these are motivation, learning, action-selection, most forms of addiction, and the 
disorders schizophrenia and Parkinson’s disease. Dopamine is called a neuromodulator because it per¬ 
forms many functions other than direct fast excitation or inhibition of targeted neurons. Although 
much remains unknown about dopamine’s functions and details of its cellular effects, it is clear that it 
is fundamental to reward processing in the mammalian brain. Dopamine is not the only neuromodulator 
involved in reward processing, and its role in aversive situations—punishment—remains controversial. 
Dopamine also can function differently in non-mammals. But no one doubts that dopamine is essential 
for reward-related processes in mammals, including humans. 

An early, traditional view is that dopamine neurons broadcast a reward signal to multiple brain 
regions implicated in learning and motivation. This view followed from a famous 1954 paper by James 
Olds and Peter Milner that described the effects of electrical stimulation on certain areas of a rat’s 
brain. They found that electrical stimulation to particular regions acted as a very powerful reward in 
controlling the rat’s behavior: “... the control exercised over the animal’s behavior by means of this 
reward is extreme, possibly exceeding that exercised by any other reward previously used in animal 
experimentation” (Olds and Milner, 1954). Later research revealed that the sites at which stimulation 
was most effective in producing this rewarding effect excited dopamine pathways, either directly or 
indirectly, that ordinarily are excited by natural rewarding stimuli. Effects similar to these with rats 
were also observed with human subjects. These observations strongly suggested that dopamine neuron 
activity signals reward. 

But if the reward prediction error hypothesis is correct—even if it accounts for only some features 
of a dopamine neuron’s activity—this traditional view of dopamine neuron activity is not entirely 
correct: phasic responses of dopamine neurons signal reward prediction errors, not reward itself. In 
reinforcement learning’s terms, a dopamine neuron’s phasic response at a time t corresponds to = 
Rt + r )V(St) - V(St-i), not to R t . 

Reinforcement learning theory and algorithms help reconcile the reward-prediction-error view with 
the conventional notion that dopamine signals reward. In many of the algorithms we discuss in this 
book, S functions as a reinforcement signal, meaning that it is the main driver of learning. For example, 
S is the critical factor in the TD model of classical conditioning, and 5 is the reinforcement signal for 
learning both a value function and a policy in an actor-critic architecture (Sections 13.5 and 15.7). 
Action-dependent forms of S are reinforcement signals for Q-learning and Sarsa. The reward signal R t 
is a crucial component of St-i, but it is not the complete determinant of its reinforcing effect in these 
algorithms. The additional term 7 V(St) — V(S t -i) is the higher-order reinforcement part of St- 1 , and 
even if reward occurs (Rt ^ 0), the TD error can be silent if the reward is fully predicted (which is fully 
explained in Section 15.6 below). 

A closer look at Olds’ and Milner’s 1954 paper, in fact, reveals that it is mainly about the reinforcing 
effect of electrical stimulation in an instrumental conditioning task. Electrical stimulation not only 
energized the rats’ behavior—through dopamine’s effect on motivation- it also led to the rats quickly 
learning to stimulate themselves by pressing a lever, which they would do frequently for long periods 
of time. The activity of dopamine neurons triggered by electrical stimulation reinforced the rats’ lever 
pressing. 

More recent experiments using optogenetic methods clinch the role of phasic responses of dopamine 
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neurons as reinforcement signals. These methods allow neuroscientists to precisely control the activity 
of selected neuron types at a millisecond timescale in awake behaving animals. Optogenetic methods 
introduce light-sensitive proteins into selected neuron types so that these neurons can be activated 
or silenced by means of flashes of laser light. The first experiment using optogenetic methods to 
study dopamine neurons showed that optogenetic stimulation producing phasic activation of dopamine 
neurons in mice was enough to condition the mice to prefer the side of a chamber where they received 
this stimulation as compared to the chamber’s other side where they received no, or lower-frequency, 
stimulation (Tsai et al. 2009). In another example, Steinberg et al. (2013) used optogenetic activation 
of dopamine neurons to create artificial bursts of dopamine neuron activity in rats at the times when 
rewarding stimuli were expected but omitted—times when dopamine neuron activity normally pauses. 
With these pauses replaced by artificial bursts, responding was sustained when it would ordinarily 
decrease due to lack of reinforcement (in extinction trials), and learning was enabled when it would 
ordinarily be blocked due to the reward being already predicted (the blocking paradigm; Section 14.2.1). 

Additional evidence for the reinforcing function of dopamine comes from optogenetic experiments 
with fruit flies, except in these animals dopamine’s effect is the opposite of its effect in mammals: 
optically triggered bursts of dopamine neuron activity act just like electric foot shock in reinforcing 
avoidance behavior, at least for the population of dopamine neurons activated (Claridge-Chang et al. 
2009). Although none of these optogenetic experiments showed that phasic dopamine neuron activity 
is specifically like a TD error, they convincingly demonstrated that phasic dopamine neuron activity 
acts just like S acts (or perhaps like minus S acts in fruit flies) as the reinforcement signal in algorithms 
for both prediction (classical conditioning) and control (instrumental conditioning). 

Dopamine neurons are particularly well suited to broadcasting a reinforcement signal to many areas 
of the brain. These neurons have huge axonal arbors, each releasing dopamine at 100 to 1,000 times 
more synaptic sites than reached by the axons of typical neurons. Figure 15.1 shows the axonal arbor 
of a single dopamine neuron whose cell body is in the SNpc of a rat’s brain. Each axon of a SNpc or 
VTA dopamine neuron makes roughly 500,000 synaptic contacts on the dendrites of neurons in targeted 
brain areas. 



Figure 15.1: Axonal arbor of a single neuron producing dopamine as a neurotransmitter whose cell body is in 
the SNpc of a rat’s brain. These axons make synaptic contacts with a huge number of dendrites of neurons in 
targeted brain areas. Adapted from Journal of Neuroscience , Matsuda, Furuta, Nakamura, Hioki, Fujiyama, 
Arai, and Kaneko, volume 29, 2009, page 451. 
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If dopamine neurons broadcast a reinforcement signal like reinforcement learning’s S, then since this 
is a scalar signal, i.e., a single number, all dopamine neurons in both the SNpc and VTA would be 
expected to activate more-or-less identically so that they would act in near synchrony to send the same 
signal to all of the sites their axons target. Although it has been a common belief that dopamine neurons 
do act together like this, modern evidence is pointing to the more complicated picture that different 
subpopulations of dopamine neurons respond to input differently depending on the structures to which 
they send their signals and the different ways these signals act on their target structures. Dopamine 
has functions other than signaling RPEs, and even for dopamine neurons that do signal RPEs, it can 
make sense to send different RPEs to different structures depending on the roles these structures play 
in producing reinforced behavior. This is beyond what we treat in any detail in this book, but vector¬ 
valued RPE signals make sense from the perspective of reinforcement learning when decisions can be 
decomposed into separate sub-decisions, or more generally, as a way to address the structural version 
of the credit assignment problem: How do you distribute credit for success (or blame for failure) of a 
decision among the many component structures that could have been involved in producing it? We say 
a bit more about this in Section 15.10 below. 

The axons of most dopamine neurons make synaptic contact with neurons in the frontal cortex and 
the basal ganglia, areas of the brain involved in voluntary movement, decision making, learning, and 
cognitive functions such a planning. Since most ideas relating dopamine to reinforcement learning focus 
on the basal ganglia, and the connections from dopamine neurons are particularly dense there, we focus 
on the basal ganglia here. The basal ganglia are a collection of neuron groups, or nuclei, lying at the base 
of the forebrain. The main input structure of the basal ganglia is called the striatum. Essentially all 
of the cerebral cortex, among other structures, provides input to the striatum. The activity of cortical 
neurons conveys a wealth of information about sensory input, internal states, and motor activity. The 
axons of cortical neurons make synaptic contacts on the dendrites of the main input/output neurons 
of the striatum, called medium spiny neurons. Output from the striatum loops back via other basal 
ganglia nuclei and the thalamus to frontal areas of cortex, and to motor areas, making it possible for 
the striatum to influence movement, abstract decision processes, and reward processing. Two main 
subdivisions of the striatum are important for reinforcement learning: the dorsal striatum, primarily 
implicated in influencing action selection, and the ventral striatum, thought to be critical for different 
aspects of reward processing, including the assignment of affective value to sensations. 

The dendrites of medium spiny neurons are covered with spines on whose tips the axons of neurons 
in the cortex make synaptic contact. Also making synaptic contact with these spines—in this case 
contacting the spine stems—are axons of dopamine neurons (Figure 15.2). This arrangement brings 
together presynaptic activity of cortical neurons, postsynaptic activity of medium spiny neurons, and 
input from dopamine neurons. What actually occurs at these spines is complex and not completely 
understood. Figure 15.2 hints at the complexity by showing two types of receptors for dopamine, 
receptors for glutamate—the neurotransmitter of the cortical inputs—and multiple ways that the various 
signals can interact. But evidence is mounting that changes in the efficacies of the synapses on the 
pathway from the cortex to the striatum, which neuroscientists call corticostriatal synapses , depend 
critically on appropriately-timed dopamine signals. 
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Figure 15.2: Spine of a striatal neuron showing input from both cortical and dopamine neurons. Axons of 
cortical neurons influence striatal neurons via corticostriatal synapses releasing the neurotransmitter glutamate 
at the tips of spines covering the dendrites of striatal neurons. An axon of a VTA or SNpc dopamine neuron is 
shown passing by the spine (from the lower right). “Dopamine varicosities” on this axon release dopamine at or 
near the spine stem, in an arrangement that brings together presynaptic input from cortex, postsynaptic activity 
of the striatal neuron, and dopamine, making it possible that several types of learning rules govern the plasticity 
of corticostriatal synapses. Each axon of a dopamine neuron makes synaptic contact with the stems of roughly 
500,000 spines. Some of the complexity omitted from our discussion is shown here by other neurotransmitter 
pathways and multiple receptor types, such as D1 an D2 dopamine receptors by which dopamine can produce 
different effects at spines and other postsynaptic sites. From Journal of Neurophysiology , W. Schultz, vol. 80, 
1998, page 10. 


15.5 Experimental Support for the Reward Prediction Error 
Hypothesis 

Dopamine neurons respond with bursts of activity to intense, novel, or unexpected visual and auditory 
stimuli that trigger eye and body movements, but very little of their activity is related to the move¬ 
ments themselves. This is surprising because degeneration of dopamine neurons is a cause of Parkinson’s 
disease, whose symptoms include motor disorders, particularly deficits in self-initiated movement. Moti¬ 
vated by the weak relationship between dopamine neuron activity and stimulus-triggered eye and body 
movements, Romo and Schultz (1990) and Schultz and Romo (1990) took the first steps toward the 
reward prediction error hypothesis by recording the activity of dopamine neurons and muscle activity 
while monkeys moved their arms. 

They trained two monkeys to reach from a resting hand position into a bin containing a bit of apple, 
a piece of cookie, or a raisin, when the monkey saw and heard the bin’s door open. The monkey could 
then grab and bring the food to its mouth. After a monkey became good at this, it was trained on two 
additional tasks. The purpose of the first task was to see what dopamine neurons do when movements 
are self-initiated. The bin was left open but covered from above so that the monkey could not see 
inside but could reach in from below. No triggering stimuli were presented, and after the monkey 
reached for and ate the food morsel, the experimenter usually (though not always), silently and unseen 
by the monkey, replaced food in the bin by sticking it onto a rigid wire. Here too, the activity of the 
dopamine neurons Romo and Schultz monitored was not related to the monkey’s movements, but a 
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large percentage of these neurons produced phasic responses whenever the monkey first touched a food 
morsel. These neurons did not respond when the monkey touched just the wire or explored the bin 
when no food was there. This was good evidence that the neurons were responding to the food and not 
to other aspects of the task. 

The purpose of Romo and Schultz’s second task was to see what happens when movements are 
triggered by stimuli. This task used a different bin with a movable cover. The sight and sound of 
the bin opening triggered reaching movements to the bin. In this case, Romo and Schultz found that 
after some period of training, the dopamine neurons no longer responded to the touch of the food but 
instead responded to the sight and sound of the opening cover of the food bin. The phasic responses 
of these neurons had shifted from the reward itself to stimuli predicting the availability of the reward. 
In a followup study, Romo and Schultz found that most of the dopamine neurons whose activity they 
monitored did not respond to the sight and sound of the bin opening outside the context of the behavioral 
task. These observations suggested that the dopamine neurons were responding neither to the initiation 
of a movement nor to the sensory properties of the stimuli, but were rather signaling an expectation of 
reward. 

Schultz’s group conducted many additional studies involving both SNpc and VTA dopamine neurons. 
A particular series of experiments was influential in suggesting that the phasic responses of dopamine 
neurons correspond to TD errors and not to simpler errors like those in the Rescorla-Wagner model 
(14.3). In the first of these experiments (Ljungberg, Apicella, and Schultz, 1992), monkeys were trained 
to depress a lever after a light was illuminated as a ‘trigger cue’ to obtain a drop of apple juice. As 
Romo and Schultz had observed earlier, many dopamine neurons initially responded to the reward—the 
drop of juice (Figure 15.3, top panel). But many of these neurons lost that reward response as training 
continued and developed responses instead to the illumination of the light that predicted the reward 
(Figure 15.3, middle panel). With continued training, lever pressing became faster while the number of 
dopamine neurons responding to the trigger cue decreased. 
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Figure 15.3: The response of dopamine neurons shifts from initial responses to primary reward to earlier 
predictive stimuli. These are plots of the number of action potentials produced by monitored dopamine neurons 
within small time intervals, averaged over all the monitored dopamine neurons (ranging from 23 to 44 neurons 
for these data). Top: dopamine neurons are activated by the unpredicted delivery of drop of apple juice. 
Middle: with learning, dopamine neurons developed responses to the reward-predicting trigger cue and lost 
responsiveness to the delivery of reward. Bottom: with the addition of an instruction cue preceding the trigger 
cue by 1 second, dopamine neurons shifted their responses from the trigger cue to the earlier instruction cue. 
From Schultz et al. (1995), MIT Press. 
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Following this study, the same monkeys were trained on a new task (Schultz, Apicella, and Ljungberg, 
1993). Here the monkeys faced two levers, each with a light above it. Illuminating one of these lights 
was an ‘instruction cue’ indicating which of the two levers would produce a drop of apple juice. In this 
task, the instruction cue preceded the trigger cue of the previous task by a fixed interval of 1 second. 
The monkeys learned to withhold reaching until seeing the trigger cue, and dopamine neuron activity 
increased, but now the responses of the monitored dopamine neurons occurred almost exclusively to the 
earlier instruction cue and not to the trigger cue (Figure 15.3, bottom panel). Here again the number of 
dopamine neurons responding to the instruction cue was much reduced when the task was well learned. 
During learning across these tasks, dopamine neuron activity shifted from initially responding to the 
reward to responding to the earlier predictive stimuli, first progressing to the trigger stimulus then 
to the still earlier instruction cue. As responding moved earlier in time it disappeared from the later 
stimuli. This shifting of responses to earlier reward predictors, while losing responses to later predictors 
is a hallmark of TD learning (see, for example, Figure 14.5). 

The task just described revealed another property of dopamine neuron activity shared with TD 
learning. The monkeys sometimes pressed the wrong key, that is, the key other than the instructed 
one, and consequently received no reward. In these trials, many of the dopamine neurons showed a 
sharp decrease in their firing rates below baseline shortly after the reward’s usual time of delivery, 
and this happened without the availability of any external cue to mark the usual time of reward 
delivery (Figure 15.4). Somehow the monkeys were internally keeping track of the timing of the reward. 
(Response timing is one area where the simplest version of TD learning needs to be modified to account 
for some of the details of the timing of dopamine neuron responses. We consider this issue in the 
following section.) 

The observations from the studies described above led Schultz and his group to conclude that 
dopamine neurons respond to unpredicted rewards, to the earliest predictors of reward, and that 
dopamine neuron activity decreases below baseline if a reward, or a predictor of reward, does not 
occur at its expected time. Researchers familiar with reinforcement learning were quick to recognize 
that these results are strikingly similar to how the TD error behaves as the reinforcement signal in a TD 
algorithm. The next section explores this similarity by working through a specific example in detail. 


15.6 TD Error/Dopamine Correspondence 

This section explains the correspondence between the TD error d and the phasic responses of dopamine 
neurons observed in the experiments just described. We examine how <5 changes over the course of 
learning in a task something like the one described above where a monkey first sees an instruction cue 
and then a fixed time later has to respond correctly to a trigger cue in order to obtain reward. We use 
a simple idealized version of this task, but we go into a lot more detail than is usual because we want 
to emphasize the theoretical basis of the parallel between TD errors and dopamine neuron activity. 

The first simplifying assumption is that the agent has already learned the actions required to obtain 
reward. Then its task is just to learn accurate predictions of future reward for the sequence of states it 
experiences. This is then a prediction task, or more technically, a policy-evaluation task: learning the 
value function for a fixed policy (Sections 4.1 and 6.1). The value function to be learned assigns to each 
state a value that predicts the return that will follow that state if the agent selects actions according 
to the given policy, where the return is the (possibly discounted) sum of all the future rewards. This is 
unrealistic as a model of the monkey’s situation because the monkey would likely learn these predictions 
at the same time that it is learning to act correctly (as would a reinforcement learning algorithm that 
learns policies as well as value functions, such as an actor-critic algorithm), but this scenario is simpler 
to describe than one in which a policy and a value function are learned simultaneously. 

Now imagine that the agent’s experience divides into multiple trials, in each of which the same 
sequence of states repeats, with a distinct state occurring on each time step during the trial. Further 
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Figure 15.4: The response of dopamine neurons drops below baseline shortly after the time when an expected 
reward fails to occur. Top: dopamine neurons are activated by the unpredicted delivery of a drop of apple juice. 
Middle: dopamine neurons respond to a conditioned stimulus (CS) that predicts reward and do not respond 
to the reward itself. Bottom: when the reward predicted by the CS fails to occur, the activity of dopamine 
neurons drops below baseline shortly after the time the reward is expected to occur. At the top of each of these 
panels is shown the average number of action potentials produced by monitored dopamine neurons within small 
time intervals around the indicated times. The raster plots below show the activity patterns of the individual 
dopamine neurons that were monitored; each dot represents an action potential. From Schultz, Dayan, and 
Montague, A Neural Substrate of Prediction and Reward, Science, vol. 275, issue 5306, pages 1593-1598, March 
14, 1997. Reprinted with permission from AAAS. 


imagine that the return being predicted is limited to the return over a trial, which makes a trial 
analogous to a reinforcement learning episode as we have defined it. In reality, of course, the returns 
being predicted are not confined to single trials, and the time interval between trials is an important 
factor in determining what an animal learns. This is true for TD learning as well, but here we assume 
that returns do not accumulate over multiple trials. Given this, then, a trial in experiments like those 
conducted by Schultz and colleagues is equivalent to an episode of reinforcement learning. (Though in 
this discussion, we will use the term trial instead of episode to relate better to the experiments.) 

As usual, we also need to make an assumption about how states are represented as inputs to the 
learning algorithm, an assumption that influences how closely the TD error corresponds to dopamine 
neuron activity. We discuss this issue later, but for now we assume the same CSC representation used 
by Montague et al. (1996) in which there is a separate internal stimulus for each state visited at each 
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time step in a trial. This reduces the process to the tabular case covered in the first part of this book. 
Finally, we assume that the agent uses TD(0) to learn a value function, V, stored in a lookup table 
initialized to be zero for all the states. We also assume that this is a deterministic task and that the 
discount factor, 7 , is very nearly one so that we can ignore it. 

Figure 15.5 shows the time courses of R , V, and 5 at several stages of learning in this policy-evaluation 
task. The time axes represent the time interval over which a sequence of states is visited in a trial (where 
for clarity we omit showing individual states). The reward signal is zero throughout each trial except 
when the agent reaches the rewarding state, shown near the right end of the time line, when the reward 
signal becomes some positive number, say R*. The goal of TD learning is to predict the return for each 
state visited in a trial, which in this undiscounted case and given our assumption that predictions are 
confined to individual trials, is simply R* for each state. 

Preceding the rewarding state is a sequence of reward-predicting states, with the earliest reward- 
predicting state shown near the left end of the time line. This is like the state near the start of a trial, 
for example like the state marked by the instruction cue in a trial of the monkey experiment of Schultz 
et al. (1993) described above. It is the first state in a trial that reliably predicts that trial’s reward. 
(Of course, in reality states visited on preceding trials are even earlier reward-predicting states, but 
since we are confining predictions to individual trials, these do not qualify as predictors of this trial’s 
reward. Below we give a more satisfactory, though more abstract, description of an earliest reward- 
predicting state.) The latest reward-predicting state in a trial is the state immediately preceding the 
trial’s rewarding state. This is the state near the far right end of the time line in Figure 15.5. Note that 
the rewarding state of a trial does not predict the return for that trial: the value of this state would 
come to predict the return over all the following trials, which here we are assuming to be zero in this 
episodic formulation. 

Figure 15.5 shows the first-trial time courses of V and S as the graphs labeled ‘early in learning.’ 
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Figure 15.5: The behavior of the TD error S during TD learning is consistent with features of the phasic 
activation of dopamine neurons. (Here S is the TD error available at time t, i.e., <5 t _i). Top : a sequence of 
states, shown as an interval of regular predictors, is followed by a non-zero reward R*. Early in learning : the 
initial value function, V, and initial 5, which at first is equal to R*. Learning complete: the value function 
accurately predicts future reward, 5 is positive at the earliest predictive state, and J = 0 at the time of the 
non-zero reward. R* omitted: at the time the predicted reward is omitted, J becomes negative. See text for a 
complete explanation of why this happens. 
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Because the reward signal is zero throughout the trial except when the rewarding state is reached, and 
all the U-values are zero, the TD error is also zero until it becomes R * at the rewarding state. This 
follows because 5 t -i = Rt + V t — Vj-i = Rt + 0 — 0 = R t , which is zero until it equals R * when the 
reward occurs. Here V t and V t -\ are respectively the estimated values of the states visited at times t 
and t— 1 in a trial. The TD error at this stage of learning is analogous to a dopamine neuron responding 
to an unpredicted reward, e.g., a drop apple juice, at the start of training. 

Throughout this first trial and all successive trials, TD(0) updates occur at each state transition as 
described in Chapter 6. This successively increases the values of the reward-predicting states, with the 
increases spreading backwards from the rewarding state, until the values converge to the correct return 
predictions. In this case (since we are assuming no discounting) the correct predictions are equal to R* 
for all the reward-predicting states. This can be seen in Figure 15.5 as the graph of V labeled ‘learning 
complete’ where the values of all the states from the earliest to the latest reward-predicting states all 
equal R*. The values of the states preceding the earliest reward-predicting state remain low (which 
Figure 15.5 shows as zero) because they are not reliable predictors of reward. 

When learning is complete, that is, when V attains its correct values, the TD errors associated with 
transitions from any reward-predicting state are zero because the predictions are now accurate. This 
is because for a transition from a reward-predicting state to another reward-predicting state, we have 
S t - 1 = Rt + Vt — Vt- 1 = 0 + R* — R* = 0, and for the transition from the latest reward-predicting state to 
the rewarding state, we have St-i = Rt + Vt — Vt -1 = R* +0 — R* = 0. On the other hand, the TD error 
on a transition from any state to the earliest reward-predicting state is positive because of the mismatch 
between this state’s low value and the larger value of the following reward-predicting state. Indeed, if 
the value of a state preceding the earliest reward-predicting state were zero, then after the transition to 
the earliest reward-predicting state, we would have that S t ~i = Rt + V t — V t -i = 0 + R* — 0 = R*. The 
‘learning complete’ graph of 5 in Figure 15.5 shows this positive value at the earliest reward-predicting 
state, and zeros everywhere else. 

The positive TD error upon transitioning to the earliest reward-predicting state is analogous to 
the persistence of dopamine responses to the earliest stimuli predicting reward. By the same token, 
when learning is complete, a transition from the latest reward-predicting state to the rewarding state 
produces a zero TD error because the latest reward-predicting state’s value, being correct, cancels the 
reward. This parallels the observation that fewer dopamine neurons generate a phasic response to a 
fully predicted reward than to an unpredicted reward. 

After learning, if the reward is suddenly omitted, the TD error goes negative at the usual time of 
reward because the value of the latest reward-predicting state is then too high: 6 t ~ i = Rt + V t — V)-i = 
0 + 0 — R* = —R*, as shown at the right end of the 'R. omitted’ graph of S in Figure 15.5. This is like 
dopamine neuron activity decreasing below baseline at the time an expected reward is omitted as seen 
in the experiment of Schultz et al. (1993) described above and shown in Figure 15.4. 

The idea of an earliest reward-predicting state deserves more attention. In the scenario described 
above, since experience is divided into trials, and we assumed that predictions are confined to indi¬ 
vidual trials, the earliest reward-predicting state is always the first state of a trial. Clearly this is 
artificial. A more general way to think of an earliest reward-predicting state is that it is an unpredicted 
predictor of reward, and there can be many such states. In an animal’s life, many different states may 
precede an earliest reward-predicting state. However, because these states are more often followed by 
other states that do not predict reward, their reward-predicting powers, that is, their values, remain 
low. A TD algorithm, if operating throughout the animal’s life, would update the values of these 
states too, but the updates would not consistently accumulate because, by assumption, none of these 
states reliably precedes an earliest reward-predicting state. If any of them did, they would be reward- 
predicting states as well. This might explain why with overtraining, dopamine responses decrease to 
even the earliest reward-predicting stimulus in a trial. With overtraining one would expect that even a 
formerly-unpredicted predictor state would become predicted by stimuli associated with earlier states: 
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the animal’s interaction with its environment both inside and outside of an experimental task would 
become commonplace. Upon breaking this routine with the introduction of a new task, however, one 
would see TD errors reappear, as indeed is observed in dopamine neuron activity. 

The example described above explains why the TD error shares key features with the phasic activity 
of dopamine neurons when the animal is learning in a task similar to the idealized task of our example. 
But not every property of the phasic activity of dopamine neurons coincides so neatly with properties 
of S. One of the most troubling discrepancies involves what happens when a reward occurs earlier than 
expected. We have seen that the omission of an expected reward produces a negative prediction error 
at the reward’s expected time, which corresponds to the activity of dopamine neurons decreasing below 
baseline when this happens. If the reward arrives later than expected, it is then an unexpected reward 
and generates a positive prediction error. This happens with both TD errors and dopamine neuron 
responses. But when reward arrives earlier than expected, dopamine neurons do not do what the TD 
error does—at least with the CSC representation used by Montague et al. (1996) and by us in our 
example. Dopamine neurons do respond to the early reward, which is consistent with a positive TD 
error because the reward is not predicted to occur then. However, at the later time when the reward is 
expected but omitted, the TD error is negative whereas, in contrast to this prediction, dopamine neuron 
activity does not drop below baseline in the way the TD model predicts (Hollerman and Schultz, 1998). 
Something more complicated is going on in the animal’s brain than simply TD learning with a CSC 
representation. 

Some of the mismatches between the TD error and dopamine neuron activity can be addressed by 
selecting suitable parameter values for the TD algorithm and by using stimulus representations other 
than the CSC representation. For instance, to address the early-reward mismatch just described, Suri 
and Schultz (1999) proposed a CSC representation in which the sequences of internal signals initiated 
by earlier stimuli are cancelled by the occurrence of a reward. Another proposal by Daw, Courville, 
and Touretzky (2006) is that the brain’s TD system uses representations produced by statistical mod¬ 
eling carried out in sensory cortex rather than simpler representations based on raw sensory input. 
Ludvig, Sutton, and Kehoe (2008) found that TD learning with a microstimulus (MS) representation 
(Figure 14.2) fits the activity of dopamine neurons in the early-reward and other situations better than 
when a CSC representation is used. Pan, Schmidt, Wickens, and Hyland (2005) found that even with 
the CSC representation, prolonged eligibility traces improve the fit of the TD error to some aspects of 
dopamine neuron activity. In general, many fine details of TD-error behavior depend on subtle interac¬ 
tions between eligibility traces, discounting, and stimulus representations. Findings like these elaborate 
and refine the reward prediction error hypothesis without refuting its core claim that the phasic activity 
of dopamine neurons is well characterized as signaling TD errors. 

On the other hand, there are other discrepancies between the TD theory and experimental data that 
are not so easily accommodated by selecting parameter values and stimulus representations (we mention 
some of these discrepancies in the Bibliographical and Historical Remarks section at the end of this 
chapter), and more mismatches are likely to be discovered as neuroscientists conduct ever more refined 
experiments. But the reward prediction error hypothesis has been functioning very effectively as a 
catalyst for improving our understanding of how the brain’s reward system works. Intricate experiments 
have been designed to validate or refute predictions derived from the hypothesis, and experimental 
results have, in turn, led to refinement and elaboration of the TD error/dopamine hypothesis. 

A remarkable aspect of these developments is that the reinforcement learning algorithms and theory 
that connect so well with properties of the dopamine system were developed from a computational 
perspective in total absence of any knowledge about the relevant properties of dopamine neurons— 
remember, TD learning and its connections to optimal control and dynamic programming were devel¬ 
oped many years before any of the experiments were conducted that revealed the TD-like nature of 
dopamine neuron activity. This unplanned correspondence, despite not being perfect, suggests that the 
TD error/dopamine parallel captures something significant about brain reward processes. 
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In addition to accounting for many features of the phasic activity of dopamine neurons, the reward 
prediction error hypothesis links neuroscience to other aspects of reinforcement learning, in particular, 
to learning algorithms that use TD errors as reinforcement signals. Neuroscience is still far from 
reaching complete understanding of the circuits, molecular mechanisms, and functions of the phasic 
activity of dopamine neurons, but evidence supporting the reward prediction error hypothesis, along 
with evidence that phasic dopamine responses are reinforcement signals for learning, suggest that the 
brain might implement something like an actor-critic algorithm in which TD errors play critical roles. 
Other reinforcement learning algorithms are plausible candidates too, but actor-critic algorithms fit 
the anatomy and physiology of the mammalian brain particularly well, as we describe in the following 
two sections. 


15.7 Neural Actor—Critic 

Actor-critic algorithms learn both policies and value functions. The ‘actor’ is the component that learns 
policies, and the ‘critic’ is the component that learns about whatever policy is currently being followed 
by the actor in order to ‘criticize’ the actor’s action choices. The critic uses a TD algorithm to learn 
the state-value function for the actor’s current policy. The value function allows the critic to critique 
the actor’s action choices by sending TD errors, 6 , to the actor. A positive 5 means that the action 
was ‘good’ because it led to a state with a better-than-expected value; a negative S means that the 
action was ‘bad’ because it led to a state with a worse-than-expected value. Based on these critiques, 
the actor continually updates its policy. 

Two distinctive features of actor-critic algorithms are responsible for thinking that the brain might 
implement an algorithm like this. First, the two components of an actor-critic algorithm—the actor and 
the critic—suggest that two parts of the striatum—the dorsal and ventral subdivisions (Section 15.4), 
both critical for reward-based learning—may function respectively something like an actor and a critic. 
A second property of actor-critic algorithms that suggests a brain implementation is that the TD error 
has the dual role of being the reinforcement signal for both the actor and the critic, though it has a 
different influence on learning in each of these components. This fits well with several properties of 
the neural circuitry: axons of dopamine neurons target both the dorsal and ventral subdivisions of the 
striatum; dopamine appears to be critical for modulating synaptic plasticity in both structures; and 
how a neuromodulator such as dopamine acts on a target structure depends on properties of the target 
structure and not just on properties of the neuromodulator. 

Section 13.5 presents actor-critic algorithms as policy gradient methods, but the actor-critic algo¬ 
rithm of Barto, Sutton, and Anderson (1983) was simpler and was presented as an artificial neural 
network. Here we describe an artificial neural network implementation something like that of Barto 
et al., and we follow Takaliashi, Schoenbaum, and Niv (2008) in giving a schematic proposal for how 
this artificial neural network might be implemented by real neural networks in the brain. We postpone 
discussion of the actor and critic learning rules until Section 15.8, where we present them as special cases 
of the policy-gradient formulation and discuss what they suggest about how dopamine might modulate 
synaptic plasticity. 

Figure 15.6a shows an implementation of an actor-critic algorithm as an artificial neural network 
with component networks implementing the actor and the critic. The critic consists of a single neuron- 
like unit, V, whose output activity represents state values, and a component shown as the diamond 
labeled TD that computes TD errors by combining V’s output with reward signals and with previous 
state values (as suggested by the loop from the TD diamond to itself). The actor network has a single 
layer of k actor units labeled A,;, i = 1,..., fc. The output of each actor unit is a component of a 
/c-dimensional action vector. An alternative is that there are k separate actions, one commanded by 
each actor unit, that compete with one another to be executed, but here we will think of the entire 
A-vector as an action. 
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Figure 15.6: Actor-critic artificial neural network and a hypothetical neural implementation, a) Actor-critic 
algorithm as an artificial neural network. The actor adjusts a policy based on the TD error S it receives from the 
critic; the critic adjusts state-value parameters using the same <5. The critic produces a TD error from the reward 
signal, R, and the current change in its estimate of state values. The actor does not have direct access to the 
reward signal, and the critic does not have direct access to the action, b) Hypothetical neural implementation 
of an actor-critic algorithm. The actor and the value-learning part of the critic are respectively placed in the 
ventral and dorsal subdivisions of the striatum. The TD error is transmitted by dopamine neurons located 
in the VTA and SNpc to modulate changes in synaptic efficacies of input from cortical areas to the ventral 
and dorsal striatum. Adapted from Frontiers in Neuroscience, vol. 2(1), 2008, Y. Takahashi, G. Schoenbaum, 
and Y. Niv, Silencing the critics: Understanding the effects of cocaine sensitization on dorsolateral and ventral 
striatum in the context of an Actor/Critic model. 


Both the critic and actor networks receive input consisting of multiple features representing the state 
of the agent’s environment. (Recall from Chapter 1 that the environment of a reinforcement learning 
agent includes components both inside and outside of the ‘organism’ containing the agent.) The figure 
shows these features as the circles labeled x\, X 2 , ■ ■ ■ ,x n , shown twice just to keep the figure simple. A 
weight representing the efficacy of a synapse is associated with each connection from each feature x-i to 
the critic unit, V, and to each of the action units, A t . The weights in the critic network parameterize 
the value function, and the weights in the actor network parameterize the policy. The networks learn as 
these weights change according to the critic and actor learning rules that we describe in the following 
section. 

The TD error produced by circuitry in the critic is the reinforcement signal for changing the weights 
in both the critic and the actor networks. This is shown in Figure 15.6a by the line labeled ‘TD error 
<5’ extending across all of the connections in the critic and actor networks. This aspect of the network 
implementation, together with the reward prediction error hypothesis and the fact that the activity of 
dopamine neurons is so widely distributed by the extensive axonal arbors of these neurons, suggests 
that an actor-critic network something like this may not be too farfetched as a hypothesis about how 
reward-related learning might happen in the brain. 

Figure 15.6b suggests—very schematically—how the artificial neural network on the figure’s left 
might map onto structures in the brain according to the hypothesis of Takahashi et al. (2008). The 
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hypothesis puts the actor and the value-learning part of the critic respectively in the dorsal and ventral 
subdivisions of the striatum, the input structure of the basal ganglia. Recall from Section 15.4 that 
the dorsal striatum is primarily implicated in influencing action selection, and the ventral striatum is 
thought to be critical for different aspects of reward processing, including the assignment of affective 
value to sensations. The cerebral cortex, along with other structures, sends input to the striatum 
conveying information about stimuli, internal states, and motor activity. 

In this hypothetical actor-critic brain implementation, the ventral striatum sends value information 
to the VTA and SNpc, where dopamine neurons in these nuclei combine it with information about 
reward to generate activity corresponding to TD errors (though exactly how dopaminergic neurons 
calculate these errors is not yet understood). The ‘TD error S’ line in Figure 15.6a becomes the line 
labeled ‘Dopamine’ in Figure 15.6b, which represents the widely branching axons of dopamine neurons 
whose cell bodies are in the VTA and SNpc. Referring back to Figure 15.2, these axons make synaptic 
contact with the spines on the dendrites of medium spiny neurons, the main input/output neurons of 
both the dorsal and ventral divisions of the striatum. Axons of the cortical neurons that send input 
to the striatum make synaptic contact on the tips of these spines. According to the hypothesis, it is 
at these spines where changes in the efficacies of the synapses from cortical regions to the stratum are 
governed by learning rules that critically depend on a reinforcement signal supplied by dopamine. 

An important implication of the hypothesis illustrated in Figure 15.6b is that the dopamine signal is 
not the ‘master’ reward signal like the scalar R t of reinforcement learning. In fact, the hypothesis implies 
that one should not necessarily be able to probe the brain and record any signal like Rt in the activity 
of any single neuron. Many interconnected neural systems generate reward-related information, with 
different structures being recruited depending on different types of rewards. Dopamine neurons receive 
information from many different brain areas, so the input to the SNpc and VTA labeled ‘Reward’ in 
Figure 15.6b should be thought of as vector of reward-related information arriving to neurons in these 
nuclei along multiple input channels. What the theoretical scalar reward signal Rt might correspond 
to, then, is the net contribution of all reward-related information to dopamine neuron activity. It is the 
result of a pattern of activity across many neurons in different areas of the brain. 

Although the actor-critic neural implementation illustrated in Figure 15.6b may be correct on some 
counts, it clearly needs to be refined, extended, and modified to qualify as a full-fledged model of the 
function of the phasic activity of dopamine neurons. The Historical and Bibliographic Remarks section 
at the end of this chapter cites publications that discuss in more detail both empirical support for this 
hypothesis and places where it falls short. We now look in detail at what the actor and critic learning 
algorithms suggest about the rules governing changes in synaptic efficacies of corticostriatal synapses. 


15.8 Actor and Critic Learning Rules 

If the brain does implement something like the actor-critic algorithm—and assuming populations of 
dopamine neurons broadcast a common reinforcement signal to the corticostriatal synapses of both 
the dorsal and ventral striatum as illustrated in Figure 15.6b (which is likely an oversimplification as 
we mentioned above)—then this reinforcement signal affects the synapses of these two structures in 
different ways. The learning rules for the critic and the actor use the same reinforcement signal, the 
TD error <5, but its effect on learning is different for these two components. The TD error (combined 
with eligibility traces) tells the actor how to update action probabilities in order to reach higher-valued 
states. Learning by the actor is like instrumental conditioning using a Law-of-Effect-type learning rule 
(Section 1.7): the actor works to keep <5 as positive as possible. On the other hand, the TD error (when 
combined with eligibility traces) tells the critic the direction and magnitude in which to change the 
parameters of the value function in order to improve its predictive accuracy. The critic works to reduce 
As magnitude to be as close to zero as possible using a learning rule like the TD model of classical 
conditioning (Section 14.2). The difference between the critic and actor learning rules is relatively 
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simple, but this difference has a profound effect on learning and is essential to how the actor-critic 
algorithm works. The difference lies solely in the eligibility traces each type of learning rule uses. 

More than one set of learning rules can be used in actor-critic neural networks like those in Fig¬ 
ure 15.6b but, to be specific, here we focus on the actor-critic algorithm for continuing problems with 
eligibility traces presented in Section 13.6. On each transition from state St to state S t +i, taking action 
At and receiving action Rt+i, that algorithm computes the TD error (5) and then updates the eligibility 
trace vectors (z™ and zf) and the parameters for the critic and actor (w and G), according to 

S t = R t+ 1 + 7t)(S t+ i,w) - v(S u w), 
zf = A w zr_ 1 + V w t)(S t ,w), 
zf = A e zf_ 1 +V e lri 7 r(A t | 5 t , 0 ), 
w ^— w T a w (5 i zj v , 

G^G + a e 5z e t , 

where 7 € [ 0 , 1 ) is a discount-rate parameter, A w c £ [ 0 , 1 ] and A w a £ [ 0 , 1 ] are bootstrapping parameters 
for the critic and the actor respectively, and a w > 0 and a e > 0 are analogous step-size parameters. 

Think of the approximate value function v as the output of a single linear neuron-like unit, called 
the critic unit and labeled V in Figure 15.6a. Then the value function is a linear function of the 
feature-vector representation of state s, x(s) = (aq(s),... , ai n (s)) T , parameterized by a weight vector 
w = (twi,..., w n ) T : 

v(s, w) = w T x(s). (15.1) 

Each Xi(s) is like the presynaptic signal to a neuron’s synapse whose efficacy is w t . The weights of 
the critic are incremented according to the rule above by a w 5*zj v , where the reinforcement signal, St, 
corresponds to a dopamine signal being broadcast to all of the critic unit’s synapses. The eligibility 
trace vector, zff, for the critic unit is a trace (average of recent values) of V v/ v(S t , w). Because v(s, w) 
is linear in the weights, V w D(S' t ,w) = x(S' t ). 

In neural terms, this means that each synapse has its own eligibility trace, which is one component 
of the vector z™. A synapse’s eligibility trace accumulates according to the level of activity arriving at 
that synapse, that is, the level of presynaptic activity, represented here by the component of the feature 
vector x(5*) arriving at that synapse. The trace otherwise decays toward zero at a rate governed 
by the fraction A w . A synapse is eligible for modification as long as its eligibility trace is non-zero. 
How the synapse’s efficacy is actually modified depends on the reinforcement signals that arrive while 
the synapse is eligible. We call eligibility traces like these of the critic unit’s synapses non-contingent 
eligibility traces because they only depend on presynaptic activity and are not contingent in any way 
on postsynaptic activity. 

The non-contingent eligibility traces of the critic unit’s synapses mean that the critic unit’s learning 
rule is essentially the TD model of classical conditioning described in Section 14.2. With the definition 
we have given above of the critic unit and its learning rule, the critic in Figure 15.6a is the same as 
the critic in the neural network actor-critic of Barto et al. (1983). Clearly, a critic like this consisting 
of just one linear neuron-like unit is the simplest starting point; this critic unit is a proxy for a more 
complicated neural network able to learn value functions of greater complexity. 

The actor in Figure 15.6a is a one-layer network of k neuron-like actor units, each receiving at time 
t the same feature vector, x(S t ), that the critic unit receives. Each actor unit j, j = 1,, k, has its 
own weight vector, Gj, but since the actor units are all identical, we describe just one of the units and 
omit the subscript. One way for these units to follow the actor-critic algorithm given in the equations 
above is for each to be a Bernoulli-logistic unit. This means that the output of each actor unit at 
each time is a random variable, A t , taking value 0 or 1. Think of value 1 as the neuron firing, that 
is, emitting an action potential. The weighted sum, 0 T x(St), of a unit’s input vector determines the 
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unit’s action probabilities via the exponential softmax distribution (13.2), which for two actions is the 
logistic function: 


7r(l|s, 9) = 1 — 7r(0|s, 9) 


1 

1 + exp(—0 T x(s)) ’ 


(15.2) 


The weights of each actor unit are incremented, as above, by: 9 -e- 9 + a 9 S t z®, where 5 again 
corresponds to the dopamine signal: the same reinforcement signal that is sent to all the critic unit’s 
synapses. Figure 15.6a shows S t . being broadcast to all the synapses of all the actor units (which makes 
this actor network a team of reinforcement learning agents, something we discuss in Section 15.10 
below). The actor eligibility trace vector zf is a trace (average of recent values) of Ve ln7r(A t |S' i , 9). 
To understand this eligibility trace refer to Exercise 13.3, which defines this kind of unit and asks you 
to give a learning rule for it. That exercise asked you to express Vg In 7r(a|s, 9) in terms of a, x(s), 
and n(a\s,9) (for arbitrary state s and action a) by calculating the gradient. For the action and state 
actually occurring at time t , the answer is The answer we were looking for is: 

S7 e n(A t \S t ,9)= (A t -ir(A t \S t ,9))x(S t ). (15.3) 


Unlike the non-contingent eligibility trace of a critic synapse that only accumulates the presynaptic 
activity x(5 t ), the eligibility trace of an actor unit’s synapse in addition depends on the activity of 
the actor unit itself. We call this a contingent eligibility trace because it is contingent on this postsy- 
naptic activity. The eligibility trace at each synapse continually decays, but increments or decrements 
depending on the activity of the presynaptic neuron and whether or not the postsynaptic neuron fires. 
The factor A t — ir(A t ]St,9) in (15.3) is positive when A t = 1 and negative otherwise. The postsy¬ 
naptic contingency in the eligibility traces of actor units is the only difference between the critic and 
actor learning rules. By keeping information about what actions were taken in what states, contin¬ 
gent eligibility traces allow credit for reward (positive 6), or blame for punishment (negative S), to be 
apportioned among the policy parameters (the efficacies of the actor units’ synapses) according to the 
contributions these parameters made to the units’ outputs that could have influenced later values of 6. 
Contingent eligibility traces mark the synapses as to how they should be modified to alter the units’ 
future responses to favor positive values of 6. 

What do the critic and actor learning rules suggest about how efficacies of corticostriatal synapses 
change? Both learning rules are related to Donald Hebb’s classic proposal that whenever a presynaptic 
signal participates in activating the postsynaptic neuron, the synapse’s efficacy increases (Hebb, 1949). 
The critic and actor learning rules share with Hebb’s proposal the idea that changes in a synapse’s 
efficacy depend on the interaction of several factors. In the critic learning rule the interaction is between 
the reinforcement signal S and eligibility traces that depend only on presynaptic signals. Neuroscientists 
call this a two-factor learning rule because the interaction is between two signals or quantities. The 
actor learning rule, on the other hand, is a three-factor learning rule because, in addition to depending 
on <5, its eligibility traces depend on both presynaptic and postsynaptic activity. Unlike Hebb’s proposal, 
however, the relative timing of the factors is critical to how synaptic efficacies change, with eligibility 
traces intervening to allow the reinforcement signal to affect synapses that were active in the recent 
past. 

Some subtleties about signal timing for the actor and critic learning rules deserve closer attention. 
In defining the neuron-like actor and critic units, we ignored the small amount of time it takes synaptic 
input to effect the firing of a real neuron. When an action potential from the presynaptic neuron 
arrives at a synapse, neurotransmitter molecules are released that diffuse across the synaptic cleft 
to the postsynaptic neuron, where they bind to receptors on the postsynaptic neuron’s surface; this 
activates molecular machinery that causes the postsynaptic neuron to fire (or to inhibit its firing in 
the case of inhibitory synaptic input). This process can take several tens of milliseconds. According 
to (15.1) and (15.2), though, the input to a critic and actor unit instantaneously produces the unit’s 
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output. Ignoring activation time like this is common in abstract models of Hebbian-style plasticity in 
which synaptic efficacies change according to a simple product of simultaneous pre- and postsynaptic 
activity. More realistic models must take activation time into account. 

Activation time is especially important for a more realistic actor unit because it influences how 
contingent eligibility traces have to work in order to properly apportion credit for reinforcement to 
the appropriate synapses. The expression (A t — n(A t \St, 6))x(St) defining contingent eligibility traces 
for the actor unit’s learning rule given above includes the postsynaptic factor ( A t — Tr(A t \S t ,0)) and 
the presynaptic factor x(5't). This works because by ignoring activation time, the presynaptic activity 
x(5' t ) participates in causing the postsynaptic activity appearing in (A t — Tr(A t \S t , 6))- To assign credit 
for reinforcement correctly, the presynaptic factor defining the eligibility trace must be a cause of the 
postsynaptic factor that also defines the trace. Contingent eligibility traces for a more realistic actor 
unit would have to take activation time into account. (Activation time should not be confused with 
the time required for a neuron to receive a reinforcement signal influenced by that neuron’s activity. 
The function of eligibility traces is to span this time interval which is generally much longer than the 
activation time. We discuss this further in the following section.) 

There are hints from neuroscience for how this process might work in the brain. Neuroscientists have 
discovered a form of Hebbian plasticity called spike-timing-dependent plasticity (STDP) that lends plau¬ 
sibility to the existence of actor-like synaptic plasticity in the brain. STDP is a Hebbian-style plasticity, 
but changes in a synapse’s efficacy depend on the relative timing of presynaptic and postsynaptic action 
potentials. The dependence can take different forms, but in the one most studied, a synapse increases 
in strength if spikes incoming via that synapse arrive shortly before the postsynaptic neuron fires. If 
the timing relation is reversed, with a presynaptic spike arriving shortly after the postsynaptic neuron 
fires, then the strength of the synapse decreases. STDP is a type of Hebbian plasticity that takes the 
activation time of a neuron into account, which is one of the ingredients needed for actor-like learning. 

The discovery of STDP has led neuroscientists to investigate the possibility of a three-factor form of 
STDP in which neuromodulatory input must follow appropriately-timed pre- and postsynaptic spikes. 
This form of synaptic plasticity, called reward-modulated STDP , is much like the actor learning rule 
discussed here. Synaptic changes that would be produced by regular STDP only occur if there is neuro¬ 
modulatory input within a time window after a presynaptic spike is closely followed by a postsynaptic 
spike. Evidence is accumulating that reward-modulated STDP occurs at the spines of medium spiny 
neurons of the dorsal striatum, with dopamine providing the neuromodulatory factor—the sites where 
actor learning takes place in the hypothetical neural implementation of an actor-critic algorithm il¬ 
lustrated in Figure 15.6b. Experiments have demonstrated reward-modulated STDP in which lasting 
changes in the efficacies of corticostriatal synapses occur only if a neuromodulatory pulse arrives within 
a time window that can last up to 10 seconds after a presynaptic spike is closely followed by a post¬ 
synaptic spike (Yagisliita et al. 2014). Although the evidence is indirect, these experiments point to 
the existence of contingent eligibility traces having prolonged time courses. The molecular mechanisms 
producing these traces, as well as the much shorter traces that likely underly STDP, are not yet un¬ 
derstood, but research focusing on time-dependent and neuromodulator-dependent synaptic plasticity 
is continuing. 

The neuron-like actor unit that we have described here, with its Law-of-Effect-style learning rule, 
appeared in somewhat simpler form in the actor-critic network of Barto et al. (1983). That network 
was inspired by the “hedonistic neuron” hypothesis proposed by physiologist A. H. Klopf (1972, 1982). 
Not all the details of Klopf’s hypothesis are consistent with what has been learned about synaptic 
plasticity, but the discovery of STDP and the growing evidence for a reward-modulated form of STDP 
suggest that Klopf’s ideas may not have been far off the mark. We discuss Klopf’s hedonistic neuron 
hypothesis next. 
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15.9 Hedonistic Neurons 


In his hedonistic neuron hypothesis, Klopf (1972, 1982) conjectured that individual neurons seek to max¬ 
imize the difference between synaptic input treated as rewarding and synaptic input treated as punishing 
by adjusting the efficacies of their synapses on the basis of rewarding or punishing consequences of their 
own action potentials. In other words, individual neurons can be trained with response-contingent rein¬ 
forcement like an animal can be trained in an instrumental conditioning task. His hypothesis included 
the idea that rewards and punishments are conveyed to a neuron via the same synaptic input that 
excites or inhibits the neuron’s spike-generating activity. (Had Klopf known what we know today about 
neuromodulatory systems, he might have assigned the reinforcing role to neuromodulatory input, but 
he wanted to avoid any centralized source of training information.) Synaptically-local traces of past 
pre- and postsynaptic activity had the key function in Klopf’s hypothesis of making synapses eligible — 
the term he introduced —for modification by later reward or punishment. He conjectured that these 
traces are implemented by molecular mechanisms local to each synapse and therefore different from the 
electrical activity of both the pre- and the postsynaptic neurons. In the Bibliographical and Historical 
Remarks section of this chapter we bring attention to some similar proposals made by others. 

Klopf specifically conjectured that synaptic efficacies change in the following way. When a neuron 
fires an action potential, all of its synapses that were active in contributing to that action potential 
become eligible to undergo changes in their efficacies. If the action potential is followed within an 
appropriate time period by an increase of reward, the efficacies of all the eligible synapses increase. 
Symmetrically, if the action potential is followed within an appropriate time period by an increase of 
punishment, the efficacies of eligible synapses decrease. This is implemented by triggering an eligibility 
trace at a synapse upon a coincidence of presynaptic and postsynaptic activity (or more exactly, upon 
pairing of presynaptic activity with the postsynaptic activity that that presynaptic activity participates 
in causing)—what we call a contingent eligibility trace. This is essentially the three-factor learning rule 
of an actor unit described in the previous section. 

The shape and time course of an eligibility trace in Klopf’s theory reflects the durations of the many 
feedback loops in which the neuron is embedded, some of which lie entirely within the brain and body 
of the organism, while others extend out through the organism’s external environment as mediated 
by its motor and sensory systems. His idea was that the shape of a synaptic eligibility trace is like 
a histogram of the durations of the feedback loops in which the neuron is embedded. The peak of 
an eligibility trace would then occur at the duration of the most prevalent feedback loops in which 
that neuron participates. The eligibility traces used by algorithms described in this book are simplified 
versions of Klopf’s original idea, being exponentially (or geometrically) decreasing functions controlled 
by the parameters A and 7 . This simplifies simulations as well as theory, but we regard these simple 
eligibility traces as a placeholders for traces closer to Klopf’s original conception, which would have 
computational advantages in complex reinforcement learning systems by refining the credit-assignment 
process. 

Klopf’s hedonistic neuron hypothesis is not as implausible as it may at first appear. A well-studied 
example of a single cell that seeks some stimuli and avoids others is the bacterium Escherichia coli. 
The movement of this single-cell organism is influenced by chemical stimuli in its environment, behavior 
known as chemotaxis. It swims in its liquid environment by rotating hairlike structures called flagella 
attached to its surface. (Yes, it rotates them!) Molecules in the bacterium’s environment bind to 
receptors on its surface. Binding events modulate the frequency with which the bacterium reverses 
flagellar rotation. Each reversal causes the bacterium to tumble in place and then head off in a random 
new direction. A little chemical memory and computation causes the frequency of flagellar reversal 
to decrease when the bacterium swims toward higher concentrations of molecules it needs to survive 
(attractants) and increase when the bacterium swims toward higher concentrations of molecules that 
are harmful (repellants). The result is that the bacterium tends to persist in swimming up attractant 
gradients and tends to avoid swimming up repellant gradients. 
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The chemotactic behavior just described is called klinokinesis. It is a kind of trial-and-error behavior, 
although it is unlikely that learning is involved: the bacterium needs a modicum of short-term memory 
to detect molecular concentration gradients, but it probably does not maintain long-term memories. 
Artificial intelligence pioneer Oliver Selfridge called this strategy “run and twiddle,” pointing out its 
utility as a basic adaptive strategy: “keep going in the same way if things are getting better, and 
otherwise move around” (Selfridge, 1978, 1984). Similarly, one might think of a neuron “swimming” 
(not literally of course) in a medium composed of the complex collection of feedback loops in which 
it is embedded, acting to obtain one type of input signal and to avoid others. Unlike the bacterium, 
however, the neuron’s synaptic strengths retain information about its past trial-and-error behavior. If 
this view of the behavior of a neuron (or just one type of neuron) is plausible, then the closed-loop 
nature of how the neuron interacts with its environment is important for understanding its behavior, 
where the neuron’s environment consists of the rest of the animal together with the environment with 
which the animal as a whole interacts. 

Klopf’s hedonistic neuron hypothesis extended beyond the idea that individual neurons are reinforce¬ 
ment learning agents. He argued that many aspects of intelligent behavior can be understood as the 
result of the collective behavior of a population of self-interested hedonistic neurons interacting with one 
another in an immense society or economic system making up an animal’s nervous system. Whether or 
not this view of nervous systems is useful, the collective behavior of reinforcement learning agents has 
implications for neuroscience. We take up this subject next. 


15.10 Collective Reinforcement Learning 

The behavior of populations of reinforcement learning agents is deeply relevant to the study of social and 
economic systems, and if anything like Klopf’s hedonistic neuron hypothesis is correct, to neuroscience 
as well. The hypothesis described above about how an actor-critic algorithm might be implemented in 
the brain only narrowly addresses the implications of the fact that the dorsal and ventral subdivisions 
of the striatum, the respective locations of the actor and the critic according to the hypothesis, each 
contain millions of medium spiny neurons whose synapses undergo change modulated by phasic bursts 
of dopamine neuron activity. 

The actor in Figure 15.6a is a single-layer network of k actor units. The actions produced by this 
network are vectors (A\, A 2 , ■ ■ ■ , Ak) T presumed to drive the animal’s behavior. Changes in the efficacies 
of the synapses of all of these units depend on the reinforcement signal S. Because actor units attempt to 
make S as large as possible, S effectively acts as a reward signal for them (so in this case reinforcement is 
the same as reward). Thus, each actor unit is itself a reinforcement learning agent—a hedonistic neuron 
if you will. Now, to make the situation as simple as possible, assume that each of these units receives 
the same reward signal at the same time (although, as indicated above, the assumption that dopamine 
is released at all the corticostriatal synapses under the same conditions and at the same times is likely 
an oversimplification). 

What can reinforcement learning theory tell us about what happens when all members of a population 
of reinforcement learning agents learn according to a common reward signal? The field of multi-agent 
reinforcement learning considers many aspects of learning by populations of reinforcement learning 
agents. Although this field is beyond the scope of this book, we believe that some of its basic concepts 
and results are relevant to thinking about the brain’s diffuse neuromodulatory systems. In multi-agent 
reinforcement learning (and in game theory), the scenario in which all the agents try to maximize 
a common reward signal that they simultaneously receive is known as a cooperative game or a team 
problem. 

What makes a team problem interesting and challenging is that the common reward signal sent to 
each agent evaluates the pattern of activity produced by the entire population, that is, it evaluates the 
collective action of the team members. This means that any individual agent has only limited ability 
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to affect the reward signal because any single agent contributes just one component of the collective 
action evaluated by the common reward signal. Effective learning in this scenario requires addressing a 
structural credit assignment problem: which team members, or groups of team members, deserve credit 
for a favorable reward signal, or blame for an unfavorable reward signal? It is a cooperative game, or 
a team problem, because the agents are united in seeking to increase the same reward signal: there 
are no conflicts of interest among the agents. The scenario would be a competitive game if different 
agents receive different reward signals, where each reward signal again evaluates the collective action of 
the population, and the objective of each agent is to increase its own reward signal. In this case there 
might be conflicts of interest among the agents, meaning that actions that are good for some agents are 
bad for others. Even deciding what the best collective action should be is a non-trivial aspect of game 
theory. This competitive setting might be relevant to neuroscience too (for example, to account for 
heterogeneity of dopamine neuron activity), but here we focus only on the cooperative, or team, case. 

How can each reinforcement learning agent in a team learn to “do the right thing” so that the collective 
action of the team is highly rewarded? An interesting result is that if each agent can learn effectively 
despite its reward signal being corrupted by a large amount of noise, and despite its lack of access to 
complete state information, then the population as a whole will learn to produce collective actions that 
improve as evaluated by the common reward signal, even when the agents cannot communicate with 
one another. Each agent faces its own reinforcement learning task in which its influence on the reward 
signal is deeply buried in the noise created by the influences of other agents. In fact, for any agent, all 
the other agents are part of its environment because its input, both the part conveying state information 
and the reward part, depends on how all the other agents are behaving. Furthermore, lacking access 
to the actions of the other agents, indeed lacking access to the parameters determining their policies, 
each agent can only partially observe the state of its environment. This makes each team member’s 
learning task very difficult, but if each uses a reinforcement learning algorithm able to increase a reward 
signal even under these difficult conditions, teams of reinforcement learning agents can learn to produce 
collective actions that improve over time as evaluated by the team’s common reward signal. 

If the team members are neuron-like units, then each unit has to have the goal of increasing the 
amount of reward it receives over time, as the actor unit does that we described in Section 15.8. Each 
unit’s learning algorithm has to have two essential features. First, it has to use contingent eligibility 
traces. Recall that a contingent eligibility trace, in neural terms, is initiated (or increased) at a synapse 
when its presynaptic input participates in causing the postsynaptic neuron to fire. A non-contingent 
eligibility trace, in contrast, is initiated or increased by presynaptic input independently of what the 
postsynaptic neuron does. As explained in Section 15.8, by keeping information about what actions were 
taken in what states, contingent eligibility traces allow credit for reward, or blame for punishment, to be 
apportioned to an agent’s policy parameters according to the contribution the values of these parameters 
made in determining the agent’s action. By similar reasoning, a team member must remember its recent 
action so that it can either increase or decrease the likelihood of producing that action according to 
the reward signal that is subsequently received. The action component of a contingent eligibility trace 
implements this action memory. Because of the complexity of the learning task, however, contingent 
eligibility is merely a preliminary step in the credit assignment process: the relationship between a 
single team member’s action and changes in the team’s reward signal is a statistical correlation that 
has to be estimated over many trials. Contingent eligibility is an essential but preliminary step in this 
process. 

Learning with non-contingent eligibility traces does not work at all in the team setting because 
it does not provide a way to correlate actions with consequent changes in the reward signal. Non¬ 
contingent eligibility traces are adequate for learning to predict, as the critic component of the actor- 
critic algorithm does, but they do not support learning to control, as the actor component must do. 
The members of a population of critic-like agents may still receive a common reinforcement signal, but 
they would all learn to predict the same quantity (which in the case of an actor-critic method, would 
be the expected return for the current policy). How successful each member of the population would 
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be in learning to predict the expected return would depend on the information it receives, which could 
be very different for different members of the population. There would be no need for the population 
to produce differentiated patterns of activity. This is not a team problem as defined here. 

A second requirement for collective learning in a team problem is that there has to be variability in 
the actions of the team members in order for the team to explore the space of collective actions. The 
simplest way for a team of reinforcement learning agents to do this is for each member to independently 
explore its own action space through persistent variability in its output. This will cause the team as a 
whole to vary its collective actions. For example, a team of the actor units described in Section 15.8 
explores the space of collective actions because the output of each unit, being a Bernoulli-logistic 
unit, probabilistically depends on the weighted sum of its input vector’s components. The weighted 
sum biases firing probability up or down, but there is always variability. Because each unit uses a 
REINFORCE policy gradient algorithm (Chapter 13), each unit adjusts its weights with the goal of 
maximizing the average reward rate it experiences while stochastically exploring its own action space. 
One can show, as Williams (1992) did, that a team of Bernoulli-logistic REINFORCE units implements 
a policy gradient algorithm as a whole with respect to average rate of the team’s common reward signal, 
where the actions are the collective actions of the team. 

Further, Williams (1992) showed that a team of Bernoulli-logistic units using REINFORCE ascends 
the average reward gradient when the units in the team are interconnected to form a multilayer neural 
network. In this case, the reward signal is broadcast to all the units in the network, though reward 
may depend only on the collective actions of the network’s output units. This means that a multilayer 
team of Bernoulli-logistic REINFORCE units learns like a multilayer network trained by the widely- 
used error backpropagation method, but in this case the backpropagation process is replaced by the 
broadcasted reward signal. In practice, the error backpropagation method is considerably faster, but 
the reinforcement learning team method is more plausible as a neural mechanism, especially in light of 
what is being learned about reward-modulated STDP as discussed in Section 15.8. 

Exploration through independent exploration by team members is only the simplest way for a team to 
explore; more sophisticated methods are possible if the team members communicate with one another so 
that they can coordinate their actions to focus on particular parts of the collective action space. There 
are also mechanisms more sophisticated than contingent eligibility traces for addressing structural credit 
assignment, which is easier in a team problem when the set of possible collective actions is restricted in 
some way. An extreme case is a winner-take-all arrangement (for example, the result of lateral inhibition 
in the brain) that restricts collective actions to those to which only one, or a few, team members 
contribute. In this case the winners get the credit or blame for resulting reward or punishment. 

Details of learning in cooperative games (or team problems) and non-cooperative game problems 
are beyond the scope of this book. The Bibliographical and Historical Remarks section at the end of 
this chapter cites a selection of the relevant publications, including extensive references to research on 
implications for neuroscience of collective reinforcement learning. 


15.11 Model-based Methods in the Brain 

Reinforcement learning’s distinction between model-free and model-based algorithms is proving to be 
useful for thinking about animal learning and decision processes. Section 14.6 discusses how this dis¬ 
tinction aligns with that between habitual and goal-directed animal behavior. The hypothesis discussed 
above about how the brain might implement an actor-critic algorithm is relevant only to an animal’s ha¬ 
bitual mode of behavior because the basic actor-critic method is model-free. What neural mechanisms 
are responsible for producing goal-directed behavior, and how do they interact with those underlying 
habitual behavior? 

One way to investigate questions about the brain structures involved in these modes of behavior is 
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to inactivate an area of a rat’s brain and then observe what the rat does in an outcome-devaluation 
experiment (Section 14.6). Results from experiments like these indicate that the actor-critic hypothesis 
described above is too simple in placing the actor in the dorsal striatum. Inactivating one part of the 
dorsal striatum, the dorsolateral striatum (DLS), impairs habit learning, causing the animal to rely more 
on goal-directed processes. On the other hand, inactivating the dorsomedial striatum (DMS) impairs 
goal-directed processes, requiring the animal to rely more on habit learning. Results like these support 
the view that the DLS in rodents is more involved in model-free processes, whereas their DMS is more 
involved in model-based processes. Results of studies with human subjects in similar experiments using 
functional neuroimaging, and with non-human primates, support the view that the analogous structures 
in the primate brain are differentially involved in habitual and goal-directed modes of behavior. 

Other studies identify activity associated with model-based processes in the prefrontal cortex of 
the human brain, the front-most part of the frontal cortex implicated in executive function, including 
planning and decision making. Specifically implicated is the orbitofrontal cortex (OFC), the part of the 
prefrontal cortex immediately above the eyes. Functional neuroimaging in humans, and also recordings 
of the activities of single neurons in monkeys, reveals strong activity in the OFC related to the subjective 
reward value of biologically significant stimuli, as well as activity related to the reward expected as a 
consequence of actions. Although not free of controversy, these results suggest significant involvement 
of the OFC in goal-directed choice. It may be critical for the reward part of an animal’s environment 
model. 

Another structure involved in model-based behavior is the hippocampus, a structure critical for 
memory and spatial navigation. A rat’s hippocampus plays a critical role in the rat’s ability to navigate 
a maze in the goal-directed manner that led Tolrnan to the idea that animals use models, or cognitive 
maps, in selecting actions (Section 14.5). The hippocampus may also be a critical component of our 
human ability to imagine new experiences (Hassabis and Maguire, 2007; Olafsdottir, Barry, Saleem, 
Hassabis, and Spiers, 2105). 

The findings that most directly implicate the hippocampus in planning—the process needed to enlist 
an environment model in making decisions—come from experiments that decode the activity of neurons 
in the hippocampus to determine what part of space hippocampal activity is representing on a moment- 
to-moment basis. When a rat pauses at a choice point in a maze, the representation of space in the 
hippocampus sweeps forward (and not backwards) along the possible paths the animal can take from 
that point (Johnson and Redislr, 2007). Furthermore, the spatial trajectories represented by these 
sweeps closely correspond to the rat’s subsequent navigational behavior (Pfeiffer and Foster, 2013). 
These results suggest that the hippocampus is critical for the state-transition part of an animal’s 
environment model, and that it is part of a system that uses the model to simulate possible future state 
sequences to assess the consequences of possible courses of action: a form of planning. 

The results described above add to a voluminous literature on neural mechanisms underlying goal- 
directed, or model-based, learning and decision making, but many questions remain unanswered. For 
example, how can areas as structurally similar as the DLS and DMS be essential components of modes 
of learning and behavior that are as different as model-free and model-based algorithms? Are separate 
structures responsible for (what we call) the transition and reward components of an environment 
model? Is all planning conducted at decision time via simulations of possible future courses of action as 
the forward sweeping activity in the hippocampus suggests? In other words, is all planning something 
like a rollout algorithm (Section 8.10)? Or are models sometimes engaged in the background to refine or 
recompute value information as illustrated by the Dyna architecture (Section 8.2)? How does the brain 
arbitrate between the use of the habit and goal-directed systems? Is there, in fact, a clear separation 
between the neural substrates of these systems? 

The evidence is not pointing to a positive answer to this last question. Summarizing the situation, 
Doll, Simon, and Daw (2012) wrote that “model-based influences appear ubiquitous more or less wher¬ 
ever the brain processes reward information,” and this is true even in the regions thought to be critical 



15.12. ADDICTION 


339 


for model-free learning. This includes the dopamine signals themselves, which can exhibit the influ¬ 
ence of model-based information in addition to the reward prediction errors thought to be the basis of 
model-free processes. 

Continuing neuroscience research informed by reinforcement learning’s model-free and model-based 
distinction has the potential to sharpen our understanding of habitual and goal-directed processes in 
the brain. A better grasp of these neural mechanisms may lead to algorithms combining model-free and 
model-based methods in ways that have not yet been explored in computational reinforcement learning. 


15.12 Addiction 


Understanding the neural basis of drug abuse is a high-priority goal of neuroscience with the potential 
to produce new treatments for this serious public health problem. One view is that drug craving is the 
result of the same motivation and learning processes that lead us to seek natural rewarding experiences 
that serve our biological needs. Addictive substances, by being intensely reinforcing, effectively co-opt 
our natural mechanisms of learning and decision making. This is plausible given that many— though not 
all—drugs of abuse increase levels of dopamine either directly or indirectly in regions around terminals 
of dopamine neuron axons in the striatum, a brain structure firmly implicated in normal reward- 
based learning (Section 15.7). But the self-destructive behavior associated with drug addiction is not 
characteristic of normal learning. What is different about dopamine-mediated learning when the reward 
is the result of an addictive drug? Is addiction the result of normal learning in response to substances 
that were largely unavailable throughout our evolutionary history, so that evolution could not select 
against their damaging effects? Or do addictive substances somehow interfere with normal dopamine- 
mediated learning? 

The reward prediction error hypothesis of dopamine neuron activity and its connection to TD learning 
are the basis of a model due to Redisli (2004) of some—but certainly not all -features of addiction. 
The model is based on the observation that administration of cocaine and some other addictive drugs 
produces a transient increase in dopamine. In the model, this dopamine surge is assumed to increase 
the TD error, <5, in a way that cannot be cancelled out by changes in the value function. In other 
words, whereas d is reduced to the degree that a normal reward is predicted by antecedent events 
(Section 15.6), the contribution to S due to an addictive stimulus does not decrease as the reward signal 
becomes predicted: drug rewards cannot be “predicted away.” The model does this by preventing 
5 from ever becoming negative when the reward signal is due to an addictive drug, thus eliminating 
the error-correcting feature of TD learning for states associated with administration of the drug. The 
result is that the values of these states increase without bound, making actions leading to these states 
preferred above all others. 

Addictive behavior is much more complicated than this result from Redish’s model, but the model’s 
main idea may be a piece of the puzzle. Or the model might be misleading. Dopamine appears not 
to play a critical role in all forms of addiction, and not everyone is equally susceptible to developing 
addictive behavior. Moreover, the model does not include the changes in many circuits and brain regions 
that accompany chronic drug taking, for example, changes that lead to a drug’s diminishing effect with 
repeated use. It is also likely that addiction involves model-based processes. Still, Redish’s model 
illustrates how reinforcement learning theory can be enlisted in the effort to understand a major health 
problem. In a similar manner, reinforcement learning theory has been influential in the development 
of the new field of computational psychiatry, which aims to improve understanding of mental disorders 
through mathematical and computational methods. 
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15.13 Summary 


The neural pathways involved in the brain’s reward system are complex and incompletely understood, 
but neuroscience research directed toward understanding these pathways and their roles in behavior 
is progressing rapidly. This research is revealing striking correspondences between the brain’s reward 
system and the theory of reinforcement learning as presented in this book. 

The reward prediction error hypothesis of dopamine neuron activity was proposed by scientists who 
recognized striking parallels between the behavior of TD errors and the activity of neurons that pro¬ 
duce dopamine, a neurotransmitter essential in mammals for reward-related learning and behavior. 
Experiments conducted in the late 1980s and 1990s in the laboratory of neuroscientist Wolfram Schultz 
showed that dopamine neurons respond to rewarding events with substantial bursts of activity, called 
phasic responses, only if the animal does not expect those events, suggesting that dopamine neurons 
are signaling reward prediction errors instead of reward itself. Further, these experiments showed that 
as an animal learns to predict a rewarding event on the basis of preceding sensory cues, the phasic 
activity of dopamine neurons shifts to earlier predictive cues while decreasing to later predictive cues. 
This parallels the backing-up effect of the TD error as a reinforcement learning agent learns to predict 
reward. 

Other experimental results firmly establish that the phasic activity of dopamine neurons is a rein¬ 
forcement signal for learning that reaches multiple areas of the brain by means of profusely branching 
axons of dopamine producing neurons. These results are consistent with the distinction we make be¬ 
tween a reward signal, R t , and a reinforcement signal, which is the TD error 6 t in most of the algorithms 
we present. Phasic responses of dopamine neurons are reinforcement signals, not reward signals. 

A prominent hypothesis is that the brain implements something like an actor-critic algorithm. Two 
structures in the brain (the dorsal and ventral subdivisions of the striatum), both of which play critical 
roles in reward-based learning, may function respectively like an actor and a critic. That the TD error 
is the reinforcement signal for both the actor and the critic fits well with the facts that dopamine neuron 
axons target both the dorsal and ventral subdivisions of the striatum; that dopamine appears to be 
critical for modulating synaptic plasticity in both structures; and that the effect on a target structure 
of a neuromodulator such as dopamine depends on properties of the target structure and not just on 
properties of the neuromodulator. 

The actor and the critic can be implemented by artificial neural networks consisting of neuron-like 
units having learning rules based on the policy-gradient actor-critic method described in Section 13.5. 
Each connection in these networks is like a synapse between neurons in the brain, and the learning 
rules correspond to rules governing how synaptic efficacies change as functions of the activities of the 
presynaptic and the postsynaptic neurons, together with neuromodulatory input corresponding to input 
from dopamine neurons. In this setting, each synapse has its own eligibility trace that records past 
activity involving that synapse. The only difference between the actor and critic learning rules is that 
they use different kinds of eligibility traces: the critic unit’s traces are non-contingent because they do 
not involve the critic unit’s output, whereas the actor unit’s traces are contingent because in addition 
to the actor unit’s input, they depend on the actor unit’s output. In the hypothetical implementation 
of an actor-critic system in the brain, these learning rules respectively correspond to rules governing 
plasticity of corticostriatal synapses that convey signals from the cortex to the principal neurons in the 
dorsal and ventral striatal subdivisions, synapses that also receive inputs from dopamine neurons. 

The learning rule of an actor unit in the actor-critic network closely corresponds to reward-modulated 
spike-timing-dependent plasticity. In spike-timing-dependent plasticity (STDP), the relative timing of 
pre- and postsynaptic activity determines the direction of synaptic change. In reward-modulated STDP, 
changes in synapses in addition depend on a neuromodulator, such as dopamine, arriving within a time 
window that can last up to 10 seconds after the conditions for STDP are met. Evidence accumulating 
that reward-modulated STDP occurs at corticostriatal synapses, where the actor’s learning takes place 
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in the hypothetical neural implementation of an actor-critic system, adds to the plausibility of the 
hypothesis that something like an actor-critic system exists in the brains of some animals. 

The idea of synaptic eligibility and basic features of the actor learning rule derive from Klopf’s 
hypothesis of the “hedonistic neuron” (Klopf, 1972, 1981). He conjectured that individual neurons 
seek to obtain reward and to avoid punishment by adjusting the efficacies of their synapses on the 
basis of rewarding or punishing consequences of their action potentials. A neuron’s activity can affect 
its later input because the neuron is embedded in many feedback loops, some within the animal’s 
nervous system and body and others passing through the animal’s external environment. Klopf’s idea 
of eligibility is that synapses are temporarily marked as eligible for modification if they participated in 
the neuron’s firing (making this the contingent form of eligibility trace). A synapse’s efficacy is modified 
if a reinforcing signal arrives while the synapse is eligible. We alluded to the chemotactic behavior of 
a bacterium as an example of a single cell that directs its movements in order to seek some molecules 
and to avoid others. 

A conspicuous feature of the dopamine system is that fibers releasing dopamine project widely to 
multiple parts of the brain. Although it is likely that only some populations of dopamine neurons 
broadcast the same reinforcement signal, if this signal reaches the synapses of many neurons involved in 
actor-type learning, then the situation can be modeled as a team problem. In this type of problem, each 
agent in a collection of reinforcement learning agents receives the same reinforcement signal, where that 
signal depends on the activities of all members of the collection, or team. If each team member uses a 
sufficiently capable learning algorithm, the team can learn collectively to improve performance of the 
entire team as evaluated by the globally-broadcast reinforcement signal, even if the team members do not 
directly communicate with one another. This is consistent with the wide dispersion of dopamine signals 
in the brain and provides a neurally plausible alternative to the widely-used error-backpropagation 
method for training multilayer networks. 

The distinction between model-free and model-based reinforcement learning is helping neuroscientists 
investigate the neural bases of habitual and goal-directed learning and decision making. Research so 
far points to their being some brain regions more involved in one type of process than the other, but 
the picture remains unclear because model-free and model-based processes do not appear to be neatly 
separated in the brain. Many questions remain unanswered. Perhaps most intriguing is evidence that 
the hippocampus, a structure traditionally associated with spatial navigation and memory, appears 
to be involved in simulating possible future courses of action as part of an animal’s decision-making 
process. This suggests that it is part of a system that uses an environment model for planning. 

Reinforcement learning theory is also influencing thinking about neural processes underlying drug 
abuse. A model of some features of drug addiction is based on the reward prediction error hypothesis. It 
proposes that an addicting stimulant, such as cocaine, destabilizes TD learning to produce unbounded 
growth in the values of actions associated with drug intake. This is far from a complete model of 
addiction, but it illustrates how a computational perspective suggests theories that can be tested with 
further research. The new field of computational psychiatry similarly focuses on the use of computational 
models, some derived from reinforcement learning, to better understand mental disorders. 

This chapter only touched the surface of how the neuroscience of reinforcement learning and the 
development of reinforcement learning in computer science and engineering have influenced one an¬ 
other. Most features of reinforcement learning algorithms owe their design to purely computational 
considerations, but some have been influenced by hypotheses about neural learning mechanisms. Re¬ 
markably, as experimental data has accumulated about the brain’s reward processes, many of the purely 
computationally-motivated features of reinforcement learning algorithms are turning out to be consis¬ 
tent with neuroscience data. Other features of computational reinforcement learning, such eligibility 
traces and the ability of teams of reinforcement learning agents to learn to act collectively under the 
influence of a globally-broadcast reinforcement signal, may also turn out to parallel experimental data 
as neuroscientists continue to unravel the neural basis of reward-based animal learning and behavior. 
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Bibliographical and Historical Remarks 

The number of publications treating parallels between the neuroscience of learning and decision making 
and the approach to reinforcement learning presented in this book is enormous. We can cite only a 
small selection. Niv (2009), Dayan and Niv (2008), Gimcher (2011), Ludvig, Bellemare, and Pearson 
(2011), and Shah (2012) are good places to start. 

Together with economics, evolutionary biology, and mathematical psychology, reinforcement learning 
theory is helping to formulate quantitative models of the neural mechanisms of choice in humans and 
non-human primates. With its focus on learning, this chapter only lightly touches upon the neuroscience 
of decision making. Glimcher (2003) introduced the field of “neuroeconomics,” in which reinforcement 
learning contributes to the study of the neural basis of decision making from an economics perspec¬ 
tive. See also Glimcher and Fehr (2013). The text on computational and mathematical modeling in 
neuroscience by Dayan and Abbott (2001) includes reinforcement learning’s role in these approaches. 
Sterling and Laughlin (2015) examined the neural basis of learning in terms of general design principles 
that enable efficient adaptive behavior. 

15.1 There are many good expositions of basic neuroscience. Kandel, Schwartz, Jessell, Siegelbaum, 
and Hudspeth (2013) is an authoritative and very comprehensive source. 

15.2 Berridge and Kringelbach (2008) reviewed the neural basis of reward and pleasure, pointing out 
that reward processing has many dimensions and involves many neural systems. Space prevents 
discussion of the influential research of Berridge and Robinson (1998), who distinguish between 
the hedonic impact of a stimulus, which they call “liking,” and the motivational effect, which 
they call “wanting.” Hare, O’Doherty, Camerer, Schultz, and Rangel (2008) examined the 
neural basis of value-related signals from an economic perspective, distinguishing between goal 
values, decision values, and prediction errors. Decision value is goal value minus action cost. 
See also Rangel, Camerer, and Montague (2008), Rangel and Hare (2010), and Peters and 
Biicliel (2010). 

15.3 The reward prediction error hypothesis of dopamine neuron activity is most prominently dis¬ 
cussed by Schultz, Montague, and Dayan (1997). The hypothesis was first explicitly put forward 
by Montague, Dayan, and Sejnowski (1996). As they stated the hypothesis, it referred to re¬ 
ward prediction errors (RPEs) but not specifically to TD errors; however, their development 
of the hypothesis made it clear that they were referring to TD errors. The earliest recogni¬ 
tion of the TD-error/dopamine connection of which we are aware is that of Montague, Dayan, 
Nowlan, Pouget, and Sejnowski (1992), who proposed a TD-error-modulated Hebbian learning 
rule motivated by results on dopamine signaling from Schultz’s group. The connection was 
also pointed out in an abstract by Quartz, Dayan, Montague, and Sejnowski (1992). Mon¬ 
tague and Sejnowski (1994) emphasized the importance of prediction in the brain and outlined 
how predictive Hebbian learning modulated by TD errors could be implemented via a diffuse 
neuromodulatory system, such as the dopamine system. Friston, Tononi, Reeke, Sporns, and 
Edelman (1994) presented a model of value-dependent learning in the brain in which synaptic 
changes are mediated by a TD-like error provided by a global neuromodulatory signal (al¬ 
though they did not single out dopamine). Montague, Dayan, Person, and Sejnowski (1995) 
presented a model of honeybee foraging using the TD error. The model is based on research 
by Hammer, Menzel, and colleagues (Hammer and Menzel, 1995; Hammer, 1997) showing that 
the neuromodulator octopamine acts as a reinforcement signal in the honeybee. Montague et 
al. (1995) pointed out that dopamine likely plays a similar role in the vertebrate brain. Barto 
(1995) related the actor-critic architecture to basal-ganglionic circuits and discussed the re¬ 
lationship between TD learning and the main results from Schultz’s group. Houk, Adams, 
and Barto (1995) suggested how TD learning and the actor-critic architecture might map onto 



15.13. 


SUMMARY 


343 


the anatomy, physiology, and molecular mechanism of the basal ganglia. Doya and Sejnowski 
(1998) extended their earlier paper on a model of birdsong learning (Doya and Sejnowski, 1994) 
by including a TD-like error identified with dopamine to reinforce the selection of auditory in¬ 
put to be memorized. O’Reilly and Frank (2006) and O’Reilly, Frank, Hazy, and Watz (2007) 
argued that phasic dopamine signals are RPEs but not TD errors. In support of their theory 
they cited results with variable interstimulus intervals that do not match predictions of a sim¬ 
ple TD model, as well as the observation that higher-order conditioning beyond second-order 
conditioning is rarely observed, while TD learning is not so limited. Dayan and Niv (2008) 
discussed “the good, the bad, and the ugly” of how reinforcement learning theory and the 
reward prediction error hypothesis align with experimental data. Glimclrer (2011) reviewed 
the empirical findings that support the reward prediction error hypothesis and emphasized the 
significance of the hypothesis for contemporary neuroscience. 

15.4 Graybiel (2000) is a brief primer on the basal ganglia. The experiments mentioned that involve 
optogenetic activation of dopamine neurons were conducted by Tsai, Zhang, Adamantidis, 
Stuber, Bonci, de Lecea, and Deisseroth (2009), Steinberg, Keiflin, Boivin, Witten, Deisseroth, 
and Janak (2013), and Claridge-Chang, Roorda, Vrontou, Sjulson, Li, Hirsh, and Miesenbock 
(2009). Fiorillo, Yun, and Song (2013), Lannnel, Lim, and Malenka (2014), and Saddoris, 
Cacciapaglia, Wightmman, and Carelli (2015) are among studies showing that the signaling 
properties of dopamine neurons are specialized for different target regions. RPE-signaling 
neurons may belong to one among multiple populations of dopamine neurons having different 
targets and subserving different functions. Eshel, Tian, Bukwich, and Uchida (2016) found 
homogeneity of reward prediction error responses of dopamine neurons in the lateral VTA during 
classical conditioning in mice, though their results do not rule out response diversity across wider 
areas. Gershman, Pesaran, and Daw (2009) studied reinforcement learning tasks that can be 
decomposed into independent components with separate reward signals, finding evidence in 
human neuroinraging data suggesting that the brain exploits this kind of structure. 

15.5 Schultz’s 1998 survey article (Schultz, 1998) is a good entree into the very extensive literature 
on reward predicting signaling of dopamine neurons. Berns, McClure, Pagnoni, and Montague 
(2001), Breiter, Aharon, Kahneman, Dale, and Shizgal (2001), Pagnoni, Zink, Montague, and 
Berns (2002), and O’Doherty, Dayan, Friston, Critchley, and Dolan (2003) described functional 
brain imaging studies supporting the existence of signals like TD errors in the human brain. 

15.6 This section roughly follows Barto (1995) in explaining how TD errors mimic the main results 
from Schultz’s group on the phasic responses of dopamine neurons. 

15.7 This section is largely based on Takahashi, Schoenbaum, and Niv (2008) and Niv (2009). To 
the best of our knowledge, Barto (1995) and Houk, Adams, and Barto (1995) first speculated 
about possible implementations of actor—critic algorithms in the basal ganglia. On the basis 
of functional magnetic resonance imaging of human subjects while engaged in instrumental 
conditioning, O’Doherty, Dayan, Schultz, Deichmann, Friston, and Dolan (2004) suggested 
that the actor and the critic are most likely located respectively in the dorsal and ventral 
striatum. Gershman, Moustafa, and Ludvig (2013) focused on how time is represented in 
reinforcement learning models of the basal ganglia, discussing evidence for, and implications 
of, various computational approaches to time representation. 

The hypothetical neural implementation of the actor-critic architecture described in this sec¬ 
tion includes very little detail about known basal ganglia anatomy and physiology. In addition 
to the more detailed hypothesis of Houk, Adams, and Barto (1995), a number of other hypothe¬ 
ses include more specific connections to anatomy and physiology and are claimed to explain 
additional data. These include hypotheses proposed by Suri and Schultz (1998, 1999), Brown, 
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Bullock, and Grossberg (1999), Contreras-Vidal and Schultz (1999), Suri, Bargas, and Arbib 
(2001), O’Reilly and Frank (2006), and O’Reilly, Frank, Hazy, and Watz (2007). Joel, Niv, and 
Ruppin (2002) critically evaluated the anatomical plausibility of several of these models and 
present an alternative intended to accommodate some neglected features of basal ganglionic 
circuitry. 

15.8 The actor learning rule discussed here is more complicated than the one in the early actor- 
critic network of Barto et al. (1983). Actor-unit eligibility traces in that network were traces 
of just A t x x(S't) instead of the full (A t — n(A t \St,6))^(S t ). That work did not benefit from 
the policy-gradient theory presented in Chapter 13 or the contributions of Williams (1986, 
1992), who showed how an artificial neural network of Bernoulli-logistic units could implement 
a policy-gradient method. 

Reynolds and Wickens (2002) proposed a three-factor rule for synaptic plasticity in the cor- 
ticostriatal pathway in which dopamine modulates changes in corticostriatal synaptic efficacy. 
They discussed the experimental support for this kind of learning rule and its possible molecular 
basis. The definitive demonstration of spike-timing-dependent plasticity (STDP) is attributed 
to Markram, Liibke, Frotscher, and Sakmann (1997), with evidence from earlier experiments 
by Levy and Steward (1983) and others that the relative timing of pre- and postsynaptic spikes 
is critical for inducing changes in synaptic efficacy. Rao and Sejnowski (2001) suggested how 
STDP could be the result of a TD-like mechanism at synapses with non-contingent eligibility 
traces lasting about 10 milliseconds. Dayan (2002) commented that this would require an error 
as in Sutton and Barto’s (1981) early model of classical conditioning and not a true TD er¬ 
ror. Representative publications from the extensive literature on reward-modulated STDP are 
Wickens (1990), Reynolds and Wickens (2002), and Calabresi, Picconi, Tozzi and Di Filippo 
(2007). Pawlak and Kerr (2008) showed that dopamine is necessary to induce STDP at the 
corticostriatal synapses of medium spiny neurons. See also Pawlak, Wickens, Kirkwood, and 
Kerr (2010). Yagishita, Hayashi-Takagi, Ellis-Davies, Urakubo, Ishii, and Kasai (2014) found 
that dopamine promotes spine enlargement of the medium spiny neurons of mice only during a 
time window of from 0.3 to 2 seconds after STDP stimulation. Izhikevich (2007) proposed and 
explored the idea of using STDP timing conditions to trigger contingent eligibility traces. 

15.9 Klopf’s hedonistic neuron hypothesis (Klopf 1972, 1982) inspired our actor-critic algorithm 
implemented as an artificial neural network with a single neuron-like unit, called the actor 
unit, implementing a Law-of-Effect-like learning rule (Barto, Sutton, and Anderson, 1983). 
Ideas related to Klopf’s synaptically-local eligibility have been proposed by others. Crow (1968) 
proposed that changes in the synapses of cortical neurons are sensitive to the consequences of 
neural activity. Emphasizing the need to address the time delay between neural activity and 
its consequences in a reward-modulated form of synaptic plasticity, he proposed a contingent 
form of eligibility, but associated with entire neurons instead of individual synapses. According 
to his hypothesis, a wave of neuronal activity 

leads to a short-term change in the cells involved in the wave such that they are 
picked out from a background of cells not so activated. ... such cells are rendered 
sensitive by the short-term change to a reward signal ... in such a way that if such a 
signal occurs before the end of the decay time of the change the synaptic connexions 
between the cells are made more effective. (Crow, 1968) 

Crow argued against previous proposals that reverberating neural circuits play this role by 
pointing out that the effect of a reward signal on such a circuit would “...establish the synaptic 
connexions leading to the reverberation (that is to say, those involved in activity at the time 
of the reward signal) and not those on the path which led to the adaptive motor output.” 
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Crow further postulated that reward signals are delivered via a “distinct neural fiber system,” 
presumably the one into which Olds and Milner (1954) tapped, that would transform synaptic 
connections “from a short into a long-term form.” 

In another farsighted hypothesis, Miller (1981) proposed a Law-of-Effect-like learning rule that 
includes synaptically-local contingent eligibility traces: 

... it is envisaged that in a particular sensory situation neurone B, by chance, fires a 
‘meaningful burst’ of activity, which is then translated into motor acts, which then 
change the situation. It must be supposed that the meaningful burst has an influence, 
at the neuronal level , on all of its own synapses which are active at the time ... thereby 
making a preliminary selection of the synapses to be strengthened, though not yet 
actually strengthening them. ...The strengthening signal ... makes the final selection 
... and accomplishes the definitive change in the appropriate synapses. (Miller, 1981, 

p. 81) 

Miller’s hypothesis also included a critic-like mechanism, which he called a “sensory analyzer 
unit,” that worked according to classical conditioning principles to provide reinforcement sig¬ 
nals to neurons so that they would learn to move from lower- to higher-valued states, thus 
anticipating the use of the TD error as a reinforcement signal in the actor-critic architecture. 
Miller’s idea not only parallels Klopf’s (with the exception of its explicit invocation of a distinct 
“strengthening signal”), it also anticipated the general features of reward-modulated STDP. 

A related though different idea, which Seung (2003) called the “hedonistic synapse,” is that 
synapses individually adjust the probability that they release neurotransmitter in the manner 
of the Law of Effect: if reward follows release, the release probability increases, and decreases 
if reward follows failure to release. This is essentially the same as the learning scheme Minsky 
used in his 1954 Princeton Ph.D. dissertation (Minsky, 1954), where he called the synapse-like 
learning element a SNARC (Stochastic Neural-Analog Reinforcement Calculator). Contingent 
eligibility is involved in these ideas too, although it is contingent on the activity of an individual 
synapse instead of the postsynaptic neuron. 

Frey and Morris (1997) proposed the idea of a “synaptic tag” for the induction of long-lasting 
strengthening of synaptic efficacy. Though not unlike Klopf’s eligibility, their tag was hypoth¬ 
esized to consist of a temporary strengthening of a synapse that could be transformed into a 
long-lasting strengthening by subsequent neuron activation. The model of O’Reilly and Frank 
(2006) and O’Reilly, Frank, Hazy, and Watz (2007) uses working memory to bridge temporal 
intervals instead of eligibility traces. Wickens and Kotter (1995) discuss possible mechanisms 
for synaptic eligibility. He, Huertas, Hong, Tie, Hell, Shouval, Kirkwood (2015) provide evi¬ 
dence supporting the existence of contingent eligibility traces in synapses of cortical neurons 
with time courses like those of the eligibility traces Klopf postulated. 

The metaphor of a neuron using a learning rule related to bacterial chemotaxis was discussed 
by Barto (1989). Koshland’s extensive study of bacterial chemotaxis was in part motivated by 
similarities between features of bacteria and features of neurons (Koshland, 1980). See also Berg 
(1975). Shimansky (2009) proposed a synaptic learning rule somewhat similar to Seung’s men¬ 
tioned above in which each synapse individually acts like a chemotactic bacterium. In this case 
a collection of synapses “swims” toward attractants in the high-dimensional space of synaptic 
weight values. Montague, Dayan, Person, and Sejnowski (1995) proposed a chemotactic-like 
model of the bee’s foraging behavior involving the neuromodulator octopamine. 

15.10 Research on the behavior of reinforcement learning agents in team and game problems has a long 
history roughly occurring in three phases. To the best of our knowledge, the first phase began 
with investigations by the Russian mathematician and physicist M. L. Tsetlin. A collection of 
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his work was published as Tsetlin (1973) after his death in 1966. Our Sections 1.7 and 4.8 refer 
to his study of learning automata in connection to bandit problems. The Tsetlin collection 
also includes studies of learning automata in team and game problems, which led to later 
work in this area using stochastic learning automata as described by Narendra and Tlrathachar 
(1974), Viswanatlran and Narendra (1974), Lakshmivarahan and Narendra (1982), Narendra 
and Wheeler (1983), Narendra (1989), and Tlrathachar and Sastry (2002). Thathachar and 
Sastry (2011) is a more recent comprehensive account. These studies were mostly restricted to 
non-associative learning automata, meaning that they did not address associative, or contextual, 
bandit problems (Section 2.9). 

The second phase began with the extension of learning automata to the associative, or contex¬ 
tual, case. Barto, Sutton, and Brouwer (1981) and Barto and Sutton (1981) experimented with 
associative stochastic learning automata in single-layer artificial neural networks to which a 
global reinforcement signal was broadcast. They called neuron-like elements implementing this 
kind of learning associative search elements (ASEs). Barto and Anandan (1985) introduced a 
more sophisticated associative reinforcement learning algorithm called the associative reward- 
penalty (Ap_p) algorithm. They proved a convergence result by combining theory of stochastic 
learning automata with theory of pattern classification. Barto (1985, 1986) and Barto and 
Jordan (1987) described results with teams of Ap_p units connected into multi-layer neural 
networks, showing that they could learn nonlinear functions, such as XOR and others, with a 
globally-broadcast reinforcement signal. Barto (1985) extensively discussed this approach to 
artificial neural networks and how this type of learning rule is related to others in the literature 
at that time. Williams (1992) mathematically analyzed and broadened this class of learning 
rules and related their use to the error backpropagation method for training multilayer artificial 
neural networks. Williams (1988) described several ways that backpropagation and reinforce¬ 
ment learning can be combined for training artificial neural networks. Williams (1992) showed 
that a special case of the Ar_p algorithm is a REINFORCE algorithm, although better results 
were obtained with the general Ap_p algorithm (Barto,1985). 

The third phase of interest in teams of reinforcement learning agents was influenced by increased 
understanding of the role of dopamine as a widely broadcast neuromodulator and speculation 
about the existence of reward-modulated STDP. Much more so than earlier research, this 
research considers details of synaptic plasticity and other constraints from neuroscience. Pub¬ 
lications include the following (chronologically and alphabetically): Bartlett and Baxter (1999, 
2000), Xie and Seung (2004), Baras and Meir (2007), Farries and Fairhall (2007), Florian (2007), 
Izhikevich (2007), Pecevski, Maass, and Legenstein (2007), Legenstein, Pecevski, and Maass 
(2008), Kolodziejski, Porr, and Worgotter (2009), Urbanczik and Senn (2009), and Vasilaki, 
Fremaux, Urbanczik, Senn, and Gerstner (2009). Nowe, Vrancx, and De Hauwere (2012) re¬ 
viewed more recent developments in the wider field of multi-agent reinforcement learning 


15.11 Yin and Knowlton (2006) reviewed findings from outcome-devaluation experiments with rodents 
supporting the view that habitual and goal-directed behavior (as psychologists use the phrase) 
are respectively most associated with processing in the dorsolateral striatum (DLS) and the 
dorsomedial striatum (DMS). Results of functional imaging experiments with human subjects 
in the outcome-devaluation setting by Valentin, Dickinson, and O’Doherty (2007) suggest that 
the orbitofrontal cortex (OFC) is an important component of goal-directed choice. Single unit 
recordings in monkeys by Padoa-Schioppa and Assad (2006) support the role of the OFC in 
encoding values guiding choice behavior. Rangel, Camerer, and Montague (2008) and Rangel 
and Hare (2010) reviewed findings from the perspective of neuroeconomics about how the 
brain makes goal-directed decisions. Pezzulo, van der Meer, Lansink, and Pennartz (2014) 
reviewed the neuroscience of internally generated sequences and presented a model of how 
these mechanisms might be components of model-based planning. Daw and Shohamy (2008) 
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proposed that while dopamine signaling connects well to habitual, or model-free, behavior, 
other processes are involved in goal-directed, or model-based, behavior. Data from experiments 
by Bromberg-Martin, Matsumoto, Hong, and Hikosaka (2010) indicate that dopamine signals 
contain information pertinent to both habitual and goal-directed behavior. Doll, Simon, and 
Daw (2012) argued that there may not a clear separation in the brain between mechanisms 
that subserve habitual and goal-directed learning and choice. 

15.12 Keiflin and Janak (2015) reviewed connections between TD errors and addiction. Nutt, Lingford- 
Huglres, Erritzoe, and Stokes (2015) critically evaluated the hypothesis that addiction is due to 
a disorder of the dopamine system. Montague, Dolan, Friston, and Dayan (2012) outlined the 
goals and early efforts in the field of computational psychiatry, and Adams, Huys, and Roiser 
(2015) reviewed more recent progress. 
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CHAPTER 15. 


NEUROSCIENCE 



Chapter 16 


Applications and Case Studies 


In this final chapter we present a few case studies of reinforcement learning. Several of these are 
substantial applications of potential economic significance. One, Samuel’s checkers player, is primarily 
of historical interest. Our presentations are intended to illustrate some of the trade-offs and issues that 
arise in real applications. For example, we emphasize how domain knowledge is incorporated into the 
formulation and solution of the problem. We also highlight the representation issues that are so often 
critical to successful applications. The algorithms used in some of these case studies are substantially 
more complex than those we have presented in the rest of the book. Applications of reinforcement 
learning are still far from routine and typically require as much art as science. Making applications 
easier and more straightforward is one of the goals of current research in reinforcement learning. 


16.1 TD-Gammon 

One of the most impressive applications of reinforcement learning to date is that by Gerald Tesauro 
to the game of backgammon (Tesauro, 1992, 1994, 1995, 2002). Tesauro’s program, TD-Gammon, 
required little backgammon knowledge, yet learned to play extremely well, near the level of the world’s 
strongest grandmasters. The learning algorithm in TD-Gammon was a straightforward combination of 
the TD(A) algorithm and nonlinear function approximation using a multilayer neural network trained 
by backpropagating TD errors. 

Backgammon is a major game in the sense that it is played throughout the world, with numerous 
tournaments and regular world championship matches. It is in part a game of chance, and it is a popular 
vehicle for waging significant sums of money. There are probably more professional backgammon players 
than there are professional chess players. The game is played with 15 white and 15 black pieces on a 
board of 24 locations, called points. Figure 16.1 shows a typical position early in the game, seen from 
the perspective of the white player. 

In this figure, white has just rolled the dice and obtained a 5 and a 2. This means that he can move 
one of his pieces 5 steps and one (possibly the same piece) 2 steps. For example, he could move two 
pieces from the 12 point, one to the 17 point, and one to the 14 point. White’s objective is to advance 
all of his pieces into the last quadrant (points 19-24) and then off the board. The first player to remove 
all his pieces wins. One complication is that the pieces interact as they pass each other going in different 
directions. For example, if it were black’s move in Figure 16.1, he could use the dice roll of 2 to move a 
piece from the 24 point to the 22 point, “hitting” the white piece there. Pieces that have been hit are 
placed on the “bar” in the middle of the board (where we already see one previously hit black piece), 
from whence they reenter the race from the start. However, if there are two pieces on a point, then the 
opponent cannot move to that point; the pieces are protected from being hit. Thus, white cannot use 
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his 5-2 dice roll to move either of his pieces on the 1 point, because their possible resulting points are 
occupied by groups of black pieces. Forming contiguous blocks of occupied points to block the opponent 
is one of the elementary strategies of the game. 

Backgammon involves several further complications, but the above description gives the basic idea. 
With 30 pieces and 24 possible locations (26, counting the bar and off-the-board) it should be clear 
that the number of possible backgammon positions is enormous, far more than the number of memory 
elements one could have in any physically realizable computer. The number of moves possible from each 
position is also large. For a typical dice roll there might be 20 different ways of playing. In considering 
future moves, such as the response of the opponent, one must consider the possible dice rolls as well. 
The result is that the game tree has an effective branching factor of about 400. This is far too large to 
permit effective use of the conventional heuristic search methods that have proved so effective in games 
like chess and checkers. 

On the other hand, the game is a good match to the capabilities of TD learning methods. Although 
the game is highly stochastic, a complete description of the game’s state is available at all times. The 
game evolves over a sequence of moves and positions until finally ending in a win for one player or 
the other, ending the game. The outcome can be interpreted as a final reward to be predicted. On 
the other hand, the theoretical results we have described so far cannot be usefully applied to this task. 
The number of states is so large that a lookup table cannot be used, and the opponent is a source of 
uncertainty and time variation. 

TD-Gammon used a nonlinear form of TD(A). The estimated value, v(s,w), of any state (board 
position) s was meant to estimate the probability of winning starting from state s. To achieve this, 
rewards were defined as zero for all time steps except those on which the game is won. To implement the 
value function, TD-Gammon used a standard multilayer neural network, much as shown in Figure 16.2. 
(The real network had two additional units in its final layer to estimate the probability of each player’s 
winning in a special way called a “gammon” or “backgammon.”) The network consisted of a layer 
of input units, a layer of hidden units, and a final output unit. The input to the network was a 
representation of a backgammon position, and the output was an estimate of the value of that position. 

In the first version of TD-Gammon, TD-Gammon 0.0, backgammon positions were represented to 
the network in a relatively direct way that involved little backgammon knowledge. It did, however, 
involve substantial knowledge of how neural networks work and how information is best presented to 
them. It is instructive to note the exact representation Tesauro chose. There were a total of 198 input 
units to the network. For each point on the backgammon board, four units indicated the number of 
white pieces on the point. If there were no white pieces, then all four units took on the value zero. If 
there was one piece, then the first unit took on the value 1. This encoded the elementary concept of a 
“blot,” i.e., a piece that can be hit by the opponent. If there were two or more pieces, then the second 
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unit was set to 1. This encoded the basic concept of a “made point” on which the opponent cannot 
land. If there were exactly three pieces on the point, then the third unit was set to 1. This encoded 
the basic concept of a “single spare,” i.e., an extra piece in addition to the two pieces that made the 
point. Finally, if there were more than three pieces, the fourth unit was set to a value proportionate 
to the number of additional pieces beyond three. Letting n denote the total number of pieces on the 
point, if n > 3, then the fourth unit took on the value (n — 3)/2. This encoded a linear representation 
of “multiple spares” at the given point. 

With four units for white and four for black at each of the 24 points, that made a total of 192 units. 
Two additional units encoded the number of white and black pieces on the bar (each took the value 
n/ 2 , where n is the number of pieces on the bar), and two more encoded the number of black and white 
pieces already successfully removed from the board (these took the value nj 15, where n is the number 
of pieces already borne off). Finally, two units indicated in a binary fashion whether it was white’s or 
black’s turn to move. The general logic behind these choices should be clear. Basically, Tesauro tried 
to represent the position in a straightforward way, while keeping the number of units relatively small. 
He provided one unit for each conceptually distinct possibility that seemed likely to be relevant, and 
he scaled them to roughly the same range, in this case between 0 and 1 . 

Given a representation of a backgammon position, the network computed its estimated value in the 
standard way. Corresponding to each connection from an input unit to a hidden unit was a real-valued 
weight. Signals from each input unit were multiplied by their corresponding weights and summed at 
the hidden unit. The output, h(j ), of hidden unit j was a nonlinear sigmoid function of the weighted 
sum: 

h(j) = * = 1 + e -L^ ’ 

where Xi is the value of the ith input unit and Wij is the weight of its connection to the jth hidden unit 
(all the weights in the network together make up the parameter vector w). The output of the sigmoid 
is always between 0 and 1 , and has a natural interpretation as a probability based on a summation 
of evidence. The computation from hidden units to the output unit was entirely analogous. Each 
connection from a hidden unit to the output unit had a separate weight. The output unit formed the 
weighted sum and then passed it through the same sigmoid nonlinearity. 

TD-Gammon used the semi-gradient form of the TD(A) algorithm described in Section 12.2, with the 
gradients computed by the error backpropagation algorithm (Rumelhart, Hinton, and Williams, 1986). 


predicted probability 
of winning, v(S t , w) 



backgammon position (198 input 


hidden units (40-80) 


units) 


Figure 16.2: The neural network used in TD-Gammon 
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Recall that the general update rule for this case is 


Wt+l 


w t + a 


Rt +1 +7'w(St+i,w t ) 


v(S t , w f ) 


Zt, 


(16.1) 


where w t is the vector of all modifiable parameters (in this case, the weights of the network) and z t is 
a vector of eligibility traces, one for each component of w t , updated by 

z t = yAz t _! + Vv(S t , w t ), 

with z 0 = 0. The gradient in this equation can be computed efficiently by the backpropagation pro¬ 
cedure. For the backgammon application, in which 7 = 1 and the reward is always zero except upon 
winning, the TD error portion of the learning rule is usually just D(5t+i,w) — v(St,w), as suggested in 
Figure 16.2. 

To apply the learning rule we need a source of backgammon games. Tesauro obtained an unending 
sequence of games by playing his learning backgammon player against itself. To choose its moves, TD- 
Gaminon considered each of the 20 or so ways it could play its dice roll and the corresponding positions 
that would result. The resulting positions are afterstates as discussed in Section 6 . 8 . The network was 
consulted to estimate each of their values. The move was then selected that would lead to the position 
with the highest estimated value. Continuing in this way, with TD-Gammon making the moves for both 
sides, it was possible to easily generate large numbers of backgammon games. Each game was treated 
as an episode, with the sequence of positions acting as the states, Sq, S±, S 2 , • ■ ■■ Tesauro applied the 
nonlinear TD rule (16.1) fully incrementally, that is, after each individual move. 

The weights of the network were set initially to small random values. The initial evaluations were 
thus entirely arbitrary. Since the moves were selected on the basis of these evaluations, the initial moves 
were inevitably poor, and the initial games often lasted hundreds or thousands of moves before one side 
or the other won, almost by accident. After a few dozen games however, performance improved rapidly. 

After playing about 300,000 games against itself, TD-Gammon 0.0 as described above learned to play 
approximately as well as the best previous backgammon computer programs. This was a striking result 
because all the previous high-performance computer programs had used extensive backgammon knowl¬ 
edge. For example, the reigning champion program at the time was, arguably, Neurogammon, another 
program written by Tesauro that used a neural network but not TD learning. Neurogammon’s network 
was trained on a large training corpus of exemplary moves provided by backgammon experts, and, in 
addition, started with a set of features specially crafted for backgammon. Neurogammon was a highly 
tuned, highly effective backgammon program that decisively won the World Backgammon Olympiad in 
1989. TD-Gammon 0.0, on the other hand, was constructed with essentially zero backgammon knowl¬ 
edge. That it was able to do as well as Neurogammon and all other approaches is striking testimony to 
the potential of self-play learning methods. 

The tournament success of TD-Gammon 0.0 with zero expert backgammon knowledge suggested an 
obvious modification: add the specialized backgammon features but keep the self-play TD learning 
method. This produced TD-Gammon 1.0. TD-Gammon 1.0 was clearly substantially better than all 
previous backgammon programs and found serious competition only among human experts. Later 
versions of the program, TD-Gammon 2.0 (40 hidden units) and TD-Gammon 2.1 (80 hidden units), 
were augmented with a selective two-ply search procedure. To select moves, these programs looked 
ahead not just to the positions that would immediately result, but also to the opponent’s possible dice 
rolls and moves. Assuming the opponent always took the move that appeared immediately best for him, 
the expected value of each candidate move was computed and the best was selected. To save computer 
time, the second ply of search was conducted only for candidate moves that were ranked highly after 
the first ply, about four or five moves on average. Two-ply search affected only the moves selected; the 
learning process proceeded exactly as before. The final versions of the program, TD-Gammon 3.0 and 
3.1, used 160 hidden units and a selective three-ply search. TD-Gammon illustrates the combination of 
learned value functions and decision-time search as in heuristic search and MCTS methods. In follow-on 



16.2. SAMUEL’S CHECKERS PLAYER 


353 


work, Tesauro and Galperin (1997) explored trajectory sampling methods as an alternative to full-width 
search, which reduced the error rate of live play by large numerical factors (4x-6x) while keeping the 
think time reasonable at ~ 5-10 seconds per move. 


Program 

Hidden 

Units 

Training 

Games 

Opponents 

Results 

TD-Gammon 0.0 

40 

300,000 

other programs 

tied for best 

TD-Gammon 1.0 

80 

300,000 

Robertie, Magriel, ... 

— 13 pts / 51 games 

TD-Gammon 2.0 

40 

800,000 

various Grandmasters 

—7 pts / 38 games 

TD-Gammon 2.1 

80 

1,500,000 

Robertie 

— 1 pt / 40 games 

TD-Gammon 3.0 

80 

1,500,000 

Kazaros 

+6 pts / 20 games 


Table 16.1: Summary of TD-Gammon Results 

During the 1990s, Tesauro was able to play his programs in a significant number of games against 
world-class human players. A summary of the results is given in Table 16.1. Based on these results and 
analyses by backgammon grandmasters (Robertie, 1992; see Tesauro, 1995), TD-Gammon 3.0 appeared 
to play at close to, or possibly better than, the playing strength of the best human players in the 
world. Tesauro reported in a subsequent article (Tesauro, 2002) the results of an extensive rollout 
analysis of the move decisions and doubling decisions of TD-Gammon relative to top human players. 
The conclusion was that TD-Gammon 3.1 had a “lopsided advantage” in piece-movement decisions, 
and a “slight edge” in doubling decisions, over top humans. 

TD-Gammon had a significant impact on the way the best human players play the game. For example, 
it learned to play certain opening positions differently than was the convention among the best human 
players. Based on TD-Gammon’s success and further analysis, the best human players now play these 
positions as TD-Gammon does (Tesauro, 1995). The impact on human play was greatly accelerated 
when several other self-teaching neural net backgammon programs inspired by TD-Gammon, such 
as Jellyfish, Snowie, and GNUBackgammon, became widely available. These programs enabled wide 
dissemination of new knowledge generated by the neural nets, resulting in great improvements in the 
overall caliber of human tournament play (Tesauro, 2002). 


16.2 Samuel’s Checkers Player 

An important precursor to Tesauro’s TD-Gammon was the seminal work of Arthur Samuel (1959, 
1967) in constructing programs for learning to play checkers. Samuel was one of the first to make 
effective use of heuristic search methods and of what we would now call temporal-difference learning. 
His checkers players are instructive case studies in addition to being of historical interest. We emphasize 
the relationship of Samuel’s methods to modern reinforcement learning methods and try to convey some 
of Samuel’s motivation for using them. 

Samuel first wrote a checkers-playing program for the IBM 701 in 1952. His first learning program 
was completed in 1955 and was demonstrated on television in 1956. Later versions of the program 
achieved good, though not expert, playing skill. Samuel was attracted to game-playing as a domain for 
studying machine learning because games are less complicated than problems “taken from life” while 
still allowing fruitful study of how heuristic procedures and learning can be used together. He chose to 
study checkers instead of chess because its relative simplicity made it possible to focus more strongly 
on learning. 

Samuel’s programs played by performing a lookahead search from each current position. They used 
what we now call heuristic search methods to determine how to expand the search tree and when to stop 
searching. The terminal board positions of each search were evaluated, or “scored,” by a value function, 
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or “scoring polynomial,” using linear function approximation. In this and other respects Samuel’s work 
seems to have been inspired by the suggestions of Shannon (1950). In particular, Samuel’s program 
was based on Shannon’s minimax procedure to find the best move from the current position. Working 
backward through the search tree from the scored terminal positions, each position was given the score 
of the position that would result from the best move, assuming that the machine would always try 
to maximize the score, while the opponent would always try to minimize it. Samuel called this the 
“backed-up score” of the position. When the minimax procedure reached the search tree’s root—the 
current position—it yielded the best move under the assumption that the opponent would be using 
the same evaluation criterion, shifted to its point of view. Some versions of Samuel’s programs used 
sophisticated search control methods analogous to what are known as “alpha-beta” cutoffs (e.g., see 
Pearl, 1984). 

Samuel used two main learning methods, the simplest of which he called rote learning. It consisted 
simply of saving a description of each board position encountered during play together with its backed- 
up value determined by the minimax procedure. The result was that if a position that had already 
been encountered were to occur again as a terminal position of a search tree, the depth of the search 
was effectively amplified since this position’s stored value cached the results of one or more searches 
conducted earlier. One initial problem was that the program was not encouraged to move along the 
most direct path to a win. Samuel gave it a “a sense of direction” by decreasing a position’s value a 
small amount each time it was backed up a level (called a ply) during the minimax analysis. “If the 
program is now faced with a choice of board positions whose scores differ only by the ply number, it 
will automatically make the most advantageous choice, choosing a low-ply alternative if winning and 
a high-ply alternative if losing” (Samuel, 1959, p. 80). Samuel found this discounting-like technique 
essential to successful learning. Rote learning produced slow but continuous improvement that was 
most effective for opening and endgame play. His program became a “better-than-average novice” after 
learning from many games against itself, a variety of human opponents, and from book games in a 
supervised learning mode. 

Rote learning and other aspects of Samuel’s work strongly suggest the essential idea of temporal- 
difference learning—that the value of a state should equal the value of likely following states. Samuel 
came closest to this idea in his second learning method, his “learning by generalization” procedure for 
modifying the parameters of the value function. Samuel’s method was the same in concept as that 
used much later by Tesauro in TD-Gammon. He played his program many games against another 
version of itself and performed an update after each move. The idea of Samuel’s update is suggested 
by the diagram in Figure 16.3. Each open circle represents a position where the program moves next, 
an on-move position, and each solid circle represents a position where the opponent moves next. An 
update was made to the value of each on-move position after a move by each side, resulting in a second 
on-move position. The update was toward the minimax value of a search launched from the second 
on-move position. Thus, the overall effect was that of a backing-up over one full move of real events 
and then a search over possible events, as suggested by Figure 16.3. Samuel’s actual algorithm was 
significantly more complex than this for computational reasons, but this was the basic idea. 

Samuel did not include explicit rewards. Instead, he fixed the weight of the most important feature, 
the piece advantage feature, which measured the number of pieces the program had relative to how 
many its opponent had, giving higher weight to kings, and including refinements so that it was better 
to trade pieces when winning than when losing. Thus, the goal of Samuel’s program was to improve its 
piece advantage, which in checkers is highly correlated with winning. 

However, Samuel’s learning method may have been missing an essential part of a sound temporal- 
difference algorithm. Temporal-difference learning can be viewed as a way of making a value function 
consistent with itself, and this we can clearly see in Samuel’s method. But also needed is a way of tying 
the value function to the true value of the states. We have enforced this via rewards and by discounting 
or giving a fixed value to the terminal state. But Samuel’s method included no rewards and no special 
treatment of the terminal positions of games. As Samuel himself pointed out, his value function could 
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have become consistent merely by giving a constant value to all positions. He hoped to discourage such 
solutions by giving his piece-advantage term a large, nonmodifiable weight. But although this may 
decrease the likelihood of finding useless evaluation functions, it does not prohibit them. For example, 
a constant function could still be attained by setting the modifiable weights so as to cancel the effect 
of the nonmodifiable one. 

Since Samuel’s learning procedure was not constrained to find useful evaluation functions, it should 
have been possible for it to become worse with experience. In fact, Samuel reported observing this during 
extensive self-play training sessions. To get the program improving again, Samuel had to intervene and 
set the weight with the largest absolute value back to zero. His interpretation was that this drastic 
intervention jarred the program out of local optima, but another possibility is that it jarred the program 
out of evaluation functions that were consistent but had little to do with winning or losing the game. 

Despite these potential problems, Samuel’s checkers player using the generalization learning method 
approached “better-than-average” play. Fairly good amateur opponents characterized it as “tricky but 
beatable” (Samuel, 1959). In contrast to the rote-learning version, this version was able to develop 
a good middle game but remained weak in opening and endgame play. This program also included 
an ability to search through sets of features to find those that were most useful in forming the value 
function. A later version (Samuel, 1967) included refinements in its search procedure, such as alpha-beta 
pruning, extensive use of a supervised learning mode called “book learning,” and hierarchical lookup 
tables called signature tables (Griffith, 1966) to represent the value function instead of linear function 
approximation. This version learned to play much better than the 1959 program, though still not at a 
master level. Samuel’s clieckers-playing program was widely recognized as a significant achievement in 
artificial intelligence and machine learning. 


16.3 Watson’s Daily-Double Wagering 

IBM Watson 1 is the system developed by a team of IBM researchers to play the popular TV quiz 
show Jeopardy!. 2 It gained fame in 2011 by winning first prize in an exhibition match against human 
champions. Although the main technical achievement demonstrated by WATSON was its ability to 
quickly and accurately answer natural language questions over broad areas of general knowledge, its 

1 Registered trademark of IBM Corp. 

2 Registered trademark of Jeopardy Productions Inc. 
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winning Jeopardy! performance also relied on sophisticated decision-making strategies for critical parts 
of the game. Tesauro, Gondek, Lechner, Fan, and Prager (2012, 2013) adapted Tesauro’s TD-Gammon 
system described above to create the strategy used by Watson in “Daily-Double” (DD) wagering in its 
celebrated winning performance against human champions. These authors report that the effectiveness 
of this wagering strategy went well beyond what human players are able to do in live game play, and 
that it, along with other advanced strategies, was an important contributor to Watson’s impressive 
winning performance. Here we focus only on DD wagering because it is the component of Watson 
that owes the most to reinforcement learning. 

Jeopardy! is played by three contestants who face a board showing 30 squares, each of which hides a 
clue and has a dollar value. The squares are arranged in six columns, each corresponding to a different 
category. A contestant selects a square, the host reads the square’s clue, and each contestant may 
choose to respond to the clue by sounding a buzzer (“buzzing in”). The first contestant to buzz in gets 
to try responding to the clue. If this contestant’s response is correct, their score increases by the dollar 
value of the square; if their response is not correct, or if they do not respond within five seconds, their 
score decreases by that amount, and the other contestants get a chance to buzz in to respond to the 
same clue. One or two squares (depending on the game’s current round) are special DD squares. A 
contestant who selects one of these gets an exclusive opportunity to respond to the square’s clue and has 
to decide—before the clue is revealed—on how much to wager, or bet. The bet has to be greater than 
five dollars but not greater than the contestant’s current score. If the contestant responds correctly to 
the DD clue, their score increases by the bet amount; otherwise it decreases by the bet amount. At 
the end of each game is a “Final Jeopardy” (FJ) round in which each contestant writes down a sealed 
bet and then writes an answer after the clue is read. The contestant with the highest score after three 
rounds of play (where a round consists of revealing all 30 clues) is the winner. The game has many 
other details, but these are enough to appreciate the importance of DD wagering. Winning or losing 
often depends on a contestant’s DD wagering strategy. 

Whenever Watson selected a DD square, it chose its bet by comparing action values, q(s, bet), that 
estimated the probability of a win from the current game state, s, for each round-dollar legal bet. Except 
for some risk-abatement measures described below, Watson selected the bet with the maximum action 
value. Action values were computed whenever a betting decision was needed by using two types of 
estimates that were learned before any live game play took place. The first were estimated values of the 
afterstates (Section 6.8) that would result from selecting each legal bet. These estimates were obtained 
from a state-value function, D(-,w), defined by parameters w, that gave estimates of the probability 
of a win for Watson from any game state. The second estimates used to compute action values gave 
the “in-category DD confidence,” pdd, which estimated the likelihood that Watson would respond 
correctly to the as-yet unrevealed DD clue. 

Tesauro et al. used the reinforcement learning approach of TD-Gammon described above to learn 
0(-,w): a straightforward combination of nonlinear TD(A) using a multilayer neural network with 
weights w trained by backpropagating TD errors during many simulated games. States were represented 
to the network by feature vectors specifically designed for Jeopardy!. Features included the current 
scores of the three players, how many DDs remained, the total dollar value of the remaining clues, 
and other information related to the amount of play left in the game. Unlike TD-Gammon, which 
learned by self-play, Watson’s v was learned over millions of simulated games against carefully-crafted 
models of human players. In-category confidence estimates were conditioned on the number of right 
responses r and wrong responses w that Watson gave in previously-played clues in the current category. 
The dependencies on (r, w) were estimated from Watson’s actual accuracies over many thousands of 
historical categories. 

With the previously learned value function v and in-category DD confidence Pdd, Watson computed 
q(s, bet ) for each legal round-dollar bet as follows: 


q(s, bet) — Pdd x v(S w + bet ,...) + (! - Pdd) x v(S w - bet,...), 


(16.2) 
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where Sw is Watson’s current score, and v gives the estimated value for the game state after Watson’s 
response to the DD clue, which is either correct or incorrect. Computing an action value this way 
corresponds to the insight from Exercise 3.16 that an action value is the expected next state value given 
the action (except that here it is the expected next afterstate value because the full next state of the 
entire game depends on the next square selection). 

Tesauro et al. found that selecting bets by maximizing action values incurred “a frightening amount of 
risk,” meaning that if Watson’s response to the clue happened to be wrong, the loss could be disastrous 
for its chances of winning. To decrease the downside risk of a wrong answer, Tesauro et al. adjusted 
(16.2) by subtracting a small fraction of the standard deviation over Watson’s correct/incorrect af¬ 
terstate evaluations. They further reduced risk by prohibiting bets that would cause the wrong-answer 
afterstate value to decrease below a certain limit. These measures slightly reduced Watson’s expecta¬ 
tion of winning, but they significantly reduced downside risk, not only in terms of average risk per DD 
bet, but even more so in extreme-risk scenarios where a risk-neutral Watson would bet most or all of 
its bankroll. 

Why was the TD-Gammon method of self-play not used to learn the critical value function D? 
Learning from self-play in Jeopardy! would not have worked very well because Watson was so different 
from any human contestant. Self-play would have led to exploration of state space regions that are 
not typical for play against human opponents, particularly human champions. In addition, unlike 
backgammon, Jeopardy! is a game of imperfect information because contestants do not have access to 
all the information influencing their opponents’ play. In particular, Jeopardy! contestants do not know 
how much confidence their opponents have for responding to clues in the various categories. Self-play 
would have been something like playing poker with someone who is holding the same cards that you 
hold. 

As a result of these complications, much of the effort in developing Watson’s DD-wagering strategy 
was devoted to creating good models of human opponents. The models did not address the natural 
language aspect of the game, but were instead stochastic process models of events that can occur during 
play. Statistics were extracted from an extensive fan-created archive of game information from the 
beginning of the show to the present day. The archive includes information such as the ordering of the 
clues, right and wrong contestant answers, DD locations, and DD and FJ bets for nearly 300,000 clues. 
Three models were constructed: an Average Contestant model (based on all the data), a Champion 
model (based on statistics from games with the 100 best players), and a Grand Champion model (based 
on statistics from games with the 10 best players). In addition to serving as opponents during learning, 
the models were used to asses the benefits produced by the learned DD-wagering strategy. Watson’s 
win rate in simulation when it used a baseline heuristic DD-wagering strategy was 61%; when it used the 
learned values and a default confidence value, its win rate increased to 64%; and with live in-category 
confidence, it was 67%. Tesauro et al. regarded this as a significant improvement, given that the DD 
wagering was needed only about 1.5 to 2 times in each game. 

Because Watson had only a few seconds to bet, as well as to select squares and decide whether or 
not to buzz in, the computation time needed to make these decisions was a critical factor. The neural 
network implementation of v allowed DD bets to be made quickly enough to meet the time constraints of 
live play. However, once games could be simulated fast enough through improvements in the simulation 
software, near the end of a game it was feasible to estimate the value of bets by averaging over many 
Monte-Carlo trials in which the consequence of each bet was determined by simulating play to the 
game’s end. Selecting endgame DD bets in live play based on Monte-Carlo trials instead of the neural 
network significantly improved Watson’s performance because errors in value estimates in endgames 
could seriously affect its chances of winning. Making all the decisions via Monte-Carlo trials might have 
led to better wagering decisions, but this was simply impossible given the complexity of the game and 
the time constraints of live play. 

Although its ability to quickly and accurately answer natural language questions stands out as 
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Watson’s major achievement, all of its sophisticated decision strategies contributed to its impressive 
defeat of human champions. According to Tesauro et al. (2012): 

... it is plainly evident that our strategy algorithms achieve a level of quantitative precision 
and real-time performance that exceeds human capabilities. This is particularly true in 
the cases of DD wagering and endgame buzzing, where humans simply cannot come close 
to matching the precise equity and confidence estimates and complex decision calculations 
performed by Watson. 


16.4 Optimizing Memory Control 

Most computers use dynamic random access memory (DRAM) as their main memory because of its 
low cost and high capacity. The job of a DRAM memory controller is to efficiently use the interface 
between the processor chip and an off-chip DRAM system to provide the high-bandwidth and low- 
latency data transfer necessary for high-speed program execution. A memory controller needs to deal 
with dynamically changing patterns of read/write requests while adhering to a large number of timing 
and resource constraints required by the hardware. This is a formidable scheduling problem, especially 
with modern processors with multiple cores sharing the same DRAM. 

Ipek, Mutlu, Martinez, and Caruana (2008) (also Martinez and Ipek, 2009) designed a reinforcement 
learning memory controller and demonstrated that it can significantly improve the speed of program 
execution over what was possible with conventional controllers at the time of their research. They 
were motivated by limitations of existing state-of-the-art controllers that used policies that did not take 
advantage of past scheduling experience and did not account for long-term consequences of scheduling 
decisions. Ipek et al.’s project was carried out by means of simulation, but they designed the controller 
at the detailed level of the hardware needed to implement it—including the learning algorithm—directly 
on a processor chip. 

Accessing DRAM involves a number of steps that have to be done according to strict time constraints. 
DRAM systems consist of multiple DRAM chips, each containing multiple rectangular arrays of storage 
cells arranged in rows and columns. Each cell stores a bit as the charge on a capacitor. Since the 
charge decreases over time, each DRAM cell needs to be recharged —refreshed—every few milliseconds 
to prevent memory content from being lost. This need to refresh the cells is why DRAM is called 
“dynamic.” 

Each cell array has a row buffer that holds a row of bits that can be transferred into or out of one of 
the array’s rows. An activate command “opens a row,” which means moving the contents of the row 
whose address is indicated by the command into the row buffer. With a row open, the controller can 
issue read and write commands to the cell array. Each read command transfers a word (a short sequence 
of consecutive bits) in the row buffer to the external data bus, and each write command transfers a 
word in the external data bus to the row buffer. Before a different row can be opened, a precharge 
command must be issued which transfers the (possibly updated) data in the row buffer back into the 
addressed row of the cell array. After this, another activate command can open a new row to be accessed. 
Read and write commands are column commands because they sequentially transfer bits into or out of 
columns of the row buffer; multiple bits can be transferred without re-opening the row. Read and write 
commands to the currently-open row can be carried out more quickly than accessing a different row, 
which would involve additional row commands: precharge and activate; this is sometimes referred to as 
“row locality.” A memory controller maintains a memory transaction queue that stores memory-access 
requests from the processors sharing the memory system. The controller has to process requests by 
issuing commands to the memory system while adhering to a large number of timing constraints. 

A controller’s policy for scheduling access requests can have a large effect on the performance of the 
memory system, such as the average latency with which requests can be satisfied and the throughput 
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the system is capable of achieving. The simplest scheduling strategy handles access requests in the order 
in which they arrive by issuing all the commands required by the request before beginning to service 
the next one. But if the system is not ready for one of these commands, or executing a command 
would result in resources being underutilized (e.g., due to timing constraints arising from servicing 
that one command), it makes sense to begin servicing a newer request before finishing the older one. 
Policies can gain efficiency by reordering requests, for example, by giving priority to read requests over 
write requests, or by giving priority to read/write commands to already open rows. The policy called 
First-Ready, First-Come-First-Serve (FR-FCFS), gives priority to column commands (read and write) 
over row commands (activate and precharge), and in case of a tie gives priority to the oldest command. 
FR-FCFS was shown to outperform other scheduling policies in terms of average memory-access latency 
under conditions commonly encountered (Rixner, 2004). 

Figure 16.4 is a high-level view of Ipek et al.’s reinforcement learning memory controller. They 
modeled the DRAM access process as an MDP whose states are the contents of the transaction queue 
and whose actions are commands to the DRAM system: precharge , activate, read, write, and No Op. 
The reward signal is 1 whenever the action is read or write, and otherwise it is 0. State transitions were 
considered to be stochastic because the next state of the system not only depends on the scheduler’s 
command, but also on aspects of the system’s behavior that the scheduler cannot control, such as the 
workloads of the processor cores accessing the DRAM system. 

Critical to this MDP are constraints on the actions available in each state. Recall from Chapter 3 
that the set of available actions can depend on the state: A t £ A(St), where A t is the action at time 
step t and A(St) is the set of actions available in state S t . In this application, the integrity of the 
DRAM system was assured by not allowing actions that would violate timing or resource constraints. 
Although Ipek et al. did not make it explicit, they effectively accomplished this by pre-defining the sets 
A(St) for all possible states S t - 

These constraints explain why the MDP has a No Op action and why the reward signal is 0 except 
when a read or write command is issued. NoOp is issued when it is the sole legal action in a state. 
To maximize utilization of the memory system, the controller’s task is to drive the system to states in 
which either a read or a write action can be selected: only these actions result in sending data over 
the external data bus, so it is only these that contribute to the throughput of the system. Although 
precharge and activate produce no immediate reward, the agent needs to select these actions to make 
it possible to later select the rewarded read and write actions. 

The scheduling agent used Sarsa (Section 6.4) to learn an action-value function. States were rep- 



Figure 16.4: High-level view of the reinforcement learning DRAM controller. The scheduler is the reinforcement 
learning agent. Its environment is represented by features of the transaction queue, and its actions are commands 
to the DRAM system. ©2009 IEEE. Reprinted, with permission, from J. F. Martinez and E. Ipek, Dynamic 
multicore resource management: A machine learning approach, Micro, IEEE, 29(5), p. 12. 
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resented by six integer-valued features. To approximate the action-value function, the algorithm used 
linear function approximation implemented by tile coding with hashing (Section 9.5.4). The tile coding 
had 32 tilings, each storing 256 action values as 16-bit fixed point numbers. Exploration was e-greedy 
with e = 0.05. 

State features included the number of read requests in the transaction queue, the number of write 
requests in the transaction queue, the number of write requests in the transaction queue waiting for 
their row to be opened, and the number of read requests in the transaction queue waiting for their row 
to be opened that are the oldest issued by their requesting processors. (The other features depended on 
how the DRAM interacts with cache memory, details we omit here.) The selection of the state features 
was based on ipek et al.’s understanding of factors that impact DRAM performance. For example, 
balancing the rate of servicing reads and writes based on how many of each are in the transaction 
queue can help avoid stalling the DRAM system’s interaction with cache memory. The authors in fact 
generated a relatively long list of potential features, and then pared them down to a handful using 
simulations guided by stepwise feature selection. 

An interesting aspect of this formulation of the scheduling problem as an MDP is that the features 
input to the tile coding for defining the action-value function were different from the features used to 
specify the action-constraint sets A(S t ). Whereas the tile coding input was derived from the contents 
of the transaction queue, the constraint sets depended on a host of other features related to timing and 
resource constraints that had to be satisfied by the hardware implementation of the entire system. In 
this way, the action constraints ensured that the learning algorithm’s exploration could not endanger 
the integrity of the physical system, while learning was effectively limited to a “safe” region of the much 
larger state space of the hardware implementation. 

Since an objective of this work was that the learning controller could be implemented on a chip so 
that learning could occur on-line while a computer is running, hardware implementation details were 
important considerations. The design included two five-stage pipelines to calculate and compare two 
action values at every processor clock cycle, and to update the appropriate action value. This included 
accessing the tile coding which was stored on-chip in static RAM. For the configuration Ipek et al. 
simulated, which was a 4GHz 4-core chip typical of high-end workstations at the time of their research, 
there were 10 processor cycles for every DRAM cycle. Considering the cycles needed to fill the pipes, 
up to 12 actions could be evaluated in each DRAM cycle. Ipek et al. found that the number of legal 
commands for any state was rarely greater than this, and that performance loss was negligible if enough 
time was not always available to consider all legal commands. These and other clever design details 
made it feasible to implement the complete controller and learning algorithm on a multi-processor chip. 

Ipek et al. evaluated their learning controller in simulation by comparing it with three other con¬ 
trollers: 1) the FR-FCFS controller mentioned above that produces the best on-average performance, 
2) a conventional controller that processes each request in order, and 3) an unrealizable ideal controller, 
called the Optimistic controller, able to sustain 100% DRAM throughput if given enough demand by 
ignoring all timing and resource constraints, but otherwise modeling DRAM latency (as row buffer hits) 
and bandwidth. They simulated nine memory-intensive parallel workloads consisting of scientific and 
data-mining applications. Figure 16.5 shows the performance (the inverse of execution time normal¬ 
ized to the performance of FR.-FCFS) of each controller for the nine applications, together with the 
geometric mean of their performances over the applications. The learning controller, labeled RL in the 
figure, improved over that of FR.-FCFS by from 7% to 33% over the nine applications, with an average 
improvement of 19%. Of course, no realizable controller can match the performance of Optimistic, 
which ignores all timing and resource constraints, but the learning controller’s performance closed the 
gap with Optimistic’s upper bound by an impressive 27%. 

Because the rationale for on-chip implementation of the learning algorithm was to allow the schedul¬ 
ing policy to adapt on-line to changing workloads, Ipek et al. analyzed the impact of on-line learning 
compared to a previously-learned fixed policy. They trained their controller with data from all nine 
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Figure 16.5: Performances of four controllers over a suite of 9 simulated benchmark applications. The controllers 
are: the simplest ‘in-order’ controller, FR-FCFS, the learning controller RL, and the unrealizable Optimistic 
controller which ignores all timing and resource constraints to provide a performance upper bound. Perfor¬ 
mance, normalized to that of FR-FCFS, is the inverse of execution time. At far right is the geometric mean of 
performances over the 9 benchmark applications for each controller. Controller RL comes closest to the ideal 
performance. ©2009 IEEE. Reprinted, with permission, from J. F. Martinez and E. Ipek, Dynamic multicore 
resource management: A machine learning approach, Micro, IEEE, 29(5), p. 13. 


benchmark applications and then held the resulting action values fixed throughout the simulated ex¬ 
ecution of the applications. They found that the average performance of the controller that learned 
on-line was 8% better than that of the controller using the fixed policy, leading them to conclude that 
on-line learning is an important feature of their approach. 

This learning memory controller was never committed to physical hardware because of the large 
cost of fabrication. Nevertheless, Ipek et al. could convincingly argue on the basis of their simulation 
results that a memory controller that learns on-line via reinforcement learning has the potential to 
improve performance to levels that would otherwise require more complex and more expensive memory 
systems, while removing from human designers some of the burden required to manually design efficient 
scheduling policies. Mukundan and Martinez (2012) took this project forward by investigating learning 
controllers with additional actions, other performance criteria, and more complex reward functions 
derived using genetic algorithms. They considered additional performance criteria related to energy 
efficiency. The results of these studies surpassed the earlier results described above and significantly 
surpassed the 2012 state-of-the-art for all of the performance criteria they considered. The approach is 
especially promising for developing sophisticated power-aware DRAM interfaces. 


16.5 Human-level Video Game Play 

One of the greatest challenges in applying reinforcement learning to real-world problems is deciding 
how to represent and store value functions and/or policies. Unless the state set is finite and small 
enough to allow exhaustive representation by a lookup table—as in many of our illustrative examples— 
one must use a parameterized function approximation scheme. Whether linear or non-linear, function 
approximation relies on features that have to be readily accessible to the learning system and able to 
convey the information necessary for skilled performance. Most successful applications of reinforcement 
learning owe much to sets of features carefully handcrafted based on human knowledge and intuition 
about the specific problem to be tackled. 

A team of researchers at Google DeepMind developed an impressive demonstration that a deep multi¬ 
layer artificial neural network (ANN) can automate the feature design process (Mnih et al., 2015, 2013). 
Multi-layer ANNs have been used for function approximation in reinforcement learning ever since the 
1986 popularization of the backpropagation algorithm as a method for learning internal representations 
(Rumelhart, Hinton, and Williams, 1986; see Section 9.6). Striking results have been obtained by 
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coupling reinforcement learning with backpropagation. The results obtained by Tesauro and colleages 
with TD-Gammon and Watson discussed above are notable examples. These and other applications 
benefited from the ability of multi-layer ANNs to learn task-relevant features. However, in all the 
examples of which we are aware, the most impressive demonstrations required the network’s input 
to be represented in terms of specialized features handcrafted for the given problem. This is vividly 
apparent in the TD-Gammon results. TD-Gannnon 0.0, whose network input was essentially a “raw” 
representation of he backgammon board, meaning that it involved very little knowledge of backgammon, 
learned to play approximately as well as the best previous backgammon computer programs. Adding 
specialized backgammon features produced TD-Gammon 1.0 which was substantially better than all 
previous backgammon programs and competed well against human experts. 

Mnih et al. developed a reinforcement learning agent called deep Q-network (DQN) that combined 
Q-learning with a deep convolutional ANN, a many-layered, or deep, ANN specialized for processing 
spatial arrays of data such as images. We describe deep convolutional ANNs in Section 9.6. By the 
time of Mnih et al.’s work with DQN, deep ANNs, including deep convolutional ANNs, had produced 
impressive results in many applications, but they had not been widely used in reinforcement learning. 

Mnih et al. used DQN to show how a single reinforcement learning agent can achieve high levels 
of performance in many different problems without relying on different problem-specific feature sets. 
To demonstrate this, they let DQN learn to play 49 different Atari 2600 video games by interacting 
with a game emulator. For learning each game, DQN used the same raw input, the same network 
architecture, and the same parameter values (e.g., step-size, discount rate, exploration parameters, and 
many more specific to the implementation). DQN achieved levels of play at or beyond human level on 
a large fraction of these games. Although the games were alike in being played by watching streams 
of video images, they varied widely in other respects. Their actions had different effects, they had 
different state-transition dynamics, and they needed different policies for earning high scores. The 
deep convolutional ANN learned to transform the raw input common to all the games into features 
specialized for representing the action values required for playing at the high level DQN achieved for 
most of the games. 

The Atari 2600 is a home video game console that was sold in various versions by Atari Inc. from 
1977 to 1992. It introduced or popularized many arcade video games that are now considered classics, 
such as Pong, Breakout, Space Invaders, and Asteroids. Although much simpler than modern video 
games, Atari 2600 games are still entertaining and challenging for human players, and they have been 
attractive as testbeds for developing and evaluating reinforcement learning methods (Diuk, Cohen, 
Littman, 2008; Naddaf, 2010; Cobo, Zang, Isbell, and Thomaz, 2011; Bellemare, Veness, and Bowling, 
2013). Bellemare, Naddaf, Veness, and Bowling (2012) developed the publicly available Arcade Learning 
Environment (ALE) to encourage and simplify using Atari 2600 games to study learning and planning 
algorithms. 

These previous studies and the availability of ALE made the Atari 2600 game collection a good choice 
for Mnih et al.’s demonstration, which was also influenced by the impressive human-level performance 
that TD-Gammon was able to achieve in backgammon. DQN is similar to TD-Gammon in using a 
multi-layer ANN as the function approximation method for a semi-gradient form of a TD algorithm, 
with the gradients computed by the backpropagation algorithm. However, instead of using TD(A) 
as TD-Gammon did, DQN used the semi-gradient form of Q-learning. TD-Gammon estimated the 
values of afterstates, which were easily obtained from the rules for making backgammon moves. To 
use the same algorithm for the Atari games would have required generating the next states for each 
possible action (which would not have been afterstates in that case). This could have been done by 
using the game emulator to run single-step simulations for all the possible actions (which ALE makes 
possible). Or a model of each game’s state-transition function could have been learned and used to 
predict next states (Oh, Guo, Lee, Lewis, and Singh, 2015). While these methods might have produced 
results comparable to DQN’s, they would have been more complicated to implement and would have 
significantly increased the time needed for learning. Another motivation for using Q-learning was that 
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DQN used the experience replay method, described below, which requires an off-policy algorithm. Being 
model-free and off-policy made Q-learning a natural choice. 

Before describing the details of DQN and how the experiments were conducted, we look at the skill 
levels DQN was able to achieve. Mnih et al. compared the scores of DQN with the scores of the best 
performing learning system in the literature at the time, the scores of a professional human games 
tester, and the scores of an agent that selected actions at random. The best system from the literature 
used linear function approximation with features hand designed using some knowledge about Atari 2600 
games (Bellemare, Naddaf, Veness, and Bowling, 2013). DQN learned on each game by interacting with 
the game emulator for 50 million frames, which corresponds to about 38 days of experience with the 
game. At the start of learning on each game, the weights of DQN’s network were reset to random values. 
To evaluate DQN’s skill level after learning, its score was averaged over 30 sessions on each game, each 
lasting up to 5 minutes and beginning with a random initial game state. The professional human tester 
played using the same emulator (with the sound turned off to remove any possible advantage over DQN 
which did not process audio). After 2 hours of practice, the human played about 20 episodes of each 
game for up to 5 minutes each and was not allowed to take any break during this time. DQN learned 
to play better than the best previous reinforcement learning systems on all but 6 of the games, and 
played better than the human player on 22 of the games. By considering any performance that scored 
at or above 75% of the human score to be comparable to, or better than, human-level play, Mnih et al. 
concluded that the levels of play DQN learned reached or exceeded human level on 29 of the 46 games. 
See Mnih et al. (2015) for a more detailed account of these results. 

For an artificial learning system to achieve these levels of play would be impressive enough, but what 
makes these results remarkable—and what many at the time considered to be breakthrough results 
for artificial intelligence—is that the very same learning system achieved these levels of play on widely 
varying games without relying on any game-specific modifications. 

A human playing any of these 49 Atari games sees 210 x 160 pixel image frames with 128 colors 
at 60Hz. In principle, exactly these images could have formed the raw input to DQN, but to reduce 
memory and processing requirements, Mnih et al. preprocessed each frame to produce an 84x84 array 
of luminance values. Since the full states of many of the Atari games are not completely observable 
from the image frames, Mnih et al. “stacked” the four most recent frames so that the inputs to the 
network had dimension 84x84x4. This did not eliminate partial observability for all of the games, but 
it was helpful in making many of them more Markovian. 

An essential point here is that these preprocessing steps were exactly the same for all 46 games. No 
game-specific prior knowledge was involved beyond the general understanding that it should still be 
possible to learn good policies with this reduced dimension and that stacking adjacent frames should 
help with the partial observability of some of the games. Since no game-specific prior knowledge beyond 
this minimal amount was used in preprocessing the image frames, we can think of the 84x84x4 input 
vectors as being “raw” input to DQN. 

The basic architecture of DQN is similar to the deep convolutional ANN illustrated in Figure 9.15 
(though unlike that network, subsampling in DQN is treated as part of each convolutional layer, with 
feature maps consisting of units having only a selection of the possible receptive fields). DQN has three 
hidden convolutional layers, followed by one fully connected hidden layer, followed by the output layer. 
The three successive hidden convolutional layers of DQN produce 32 20 x 20 feature maps, 64 9 x 9 
feature maps, and 64 7 x 7 feature maps. The activation function of the units of each feature map is a 
rectifier nonlinearity (max(0,:r)). The 3,136 (64x7x7) units in this third convolutional layer all connect 
to each of 512 units in the fully connected hidden layer, which then each connect to all 18 units in the 
output layer, one for each possible action in an Atari game. 

The activation levels of DQN’s output units were the estimated optimal action values (optimal Q- 
values) of the corresponding state-action pairs, for the state represented by the network’s input. The 
assignment of output units to a game’s actions varied from game to game, and since the number of 
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valid actions varied between 4 and 18 for the games, not all output units had functional roles in all of 
the games. It helps to think of the network as if it were 18 separate networks, one for estimating the 
optimal action value of each possible action. In reality, these networks shared their initial layers, but 
the output units learned to use the features extracted by these layers in different ways. 

DQN’s reward signal indicated how a games’s score changed from one time step to the next: +1 
whenever it increased, —1 whenever it decreased, and 0 otherwise. This standardized the reward signal 
across the games and made a single step-size parameter work well for all the games despite their varying 
ranges of scores. DQN used an e-greedy policy, with e decreasing linearly over the first million frames 
and remaining at a low value for the rest of the learning session. The values of the various other 
parameters, such as the learning step-size, discount rate, and others specific to the implementation, 
were selected by performing informal searches to see which values worked best for a small selection of 
the games. These values were then held fixed for all of the games. 

After DQN selected an action, the action was executed by the game emulator, which returned a 
reward and the next video frame. The frame was preprocessed and added to the four-frame stack that 
became the next input to the network. Skipping for the moment the changes to the basic Q-learning 
procedure made by Mnih et al., DQN used the following semi-gradient form of Q-learning to update 
the network’s weights: 
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where w t is the vector of the network’s weights, A t is the action selected at time step t. and St and 
St+i are respectively the preprocessed image stacks input to the network at time steps t and t + 1. 

The gradient in (16.3) was computed by backpropagation. Imagining again that there was a separate 
network for each action, for the update at time step t , backpropagation was applied only to the network 
corresponding to A t . Mnih et al. took advantage of techniques shown to improve the basic backpropa¬ 
gation algorithm when applied to large networks. They used a mini-batch method that updated weights 
only after accumulating gradient information over a small batch of images (here after 32 images). This 
yielded smoother sample gradients compared to the usual procedure that updates weights after each 
action. They also used a gradient-ascent algorithm called RMSProp (Tieleman and Hinton, 2012) that 
accelerates learning by adjusting the step-size parameter for each weight based on a running average of 
the magnitudes of recent gradients for that weight. 

Mnih et al. modified the basic Q-learning procedure in three ways. First, they used a method called 
experience replay first studied by Lin (1992). This method stores the agent’s experience at each time 
step in a replay memory that is accessed to perform the weight updates. It worked like this in DQN. 
After the game emulator executed action A t in a state represented by the image stack S t , and returned 
reward Rt+i and image stack St+ i, it added the tuple (St, A t , Rt+i, St+i) to the replay memory. This 
memory accumulated experiences over many plays of the same game. At each time step multiple Q- 
learning updates—a mini-batch—were performed based on experiences sampled uniformly at random 
from the replay memory. Instead of 5)+! becoming the new St for the next update as it would in the 
usual form of Q-learning, a new unconnected experience was drawn from the replay memory to supply 
data for the next update. Since Q-learning is an off-policy algorithm, it does not need to be applied 
along connected trajectories. 

Q-learning with experience replay provided several advantages over the usual form of Q-learning. 
The ability to use each stored experience for many updates allowed DQN to learn more efficiently from 
its experiences. Experience replay reduced the variance of the updates because successive updates were 
not correlated with one another as they would be with standard Q-learning. And by removing the 
dependence of successive experiences on the current weights, experience replay eliminated one source 
of instability. 

Mnih et al. modified standard Q-learning in a second way to improve its stability. As in other methods 
that bootstrap, the target for a Q-learning update depends on the current action-value function estimate. 
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When a parameterized function approximation method is used to represent action values, the target 
is a function of the same parameters that are being updated. For example, the target in the update 
given by (16.3) is 7 max a q(St+i, a, w t ). Its dependence on w t complicates the process compared to 
the simpler supervised-learning situation in which the targets do not depend on the parameters being 
updated. As discussed in Chapter 11 this can lead to oscillations and/or divergence. 

To address this problem Mnih et al. used a technique that brought Q-learning closer to the simpler 
supervised-learning case while still allowing it to bootstrap. Whenever a certain number, C , of updates 
had been done to the weights w of the action-value network, they inserted the network’s current weights 
into another network and held these duplicate weights fixed for the next C updates of w. The outputs 
of this duplicate network over the next C updates of w were used as the Q-learning targets. Letting q 
denote the output of this duplicate network, then instead of (16.3) the update rule was: 


w t+ i = w t 


Rt+ 1 + 7 maxg(5)+i, a, w t ) - q(S t ,A t , w t ) V Wt g(5 t , A t , w t ). 


A final modification of standard Q-learning was also found to improve stability. They clipped the error 
term R t+ 1 + 7 max a g(£ t+1 , a, w t ) — q(S t ,A t , w t ) so that it remained in the interval [—1,1]. 

Mnih et al. conducted a large number of learning runs on 5 of the games to gain insight into the 
effect that various of DQN’s design features had on its performance. They ran DQN with the four 
combinations of experience replay and the duplicate target network being included or not included. 
Although the results varied from game to game, each of these features alone significantly improved 
performance, and very dramatically improved performance when used together. Mnih et al. also studied 
the role played by the deep convolutional ANN in DQN’s learning ability by comparing the deep 
convolutional version of DQN with a version having a network of just one linear layer, both receiving 
the same stacked preprocessed video frames. Here, the improvement of the deep convolutional version 
over the linear version was particularly striking across all 5 of the test games. 

Creating artificial agents that excel over a diverse collection of challenging tasks has been an enduring 
goal of artificial intelligence. The promise of machine learning as a means for achieving this has been 
frustrated by the need to craft problem-specific representations. DeepMind’s DQN stands as a major 
step forward by demonstrating that a single agent can learn problem-specific features enabling it to 
acquire human-competitive skills over a range of tasks. But as Mnih et al. point out, DQN is not a 
complete solution to the problem of task-independent learning. Although the skills needed to excel on 
the Atari games were markedly diverse, all the games were played by observing video images, which 
made a deep convolutional ANN a natural choice for this collection of tasks. In addition, DQN’s 
performance on some of the Atari 2600 games fell considerably short of human skill levels on these 
games. The games most difficult for DQN—especially Montezuma’s Revenge on which DQN learned to 
perform about as well as the random player—require deep planning beyond what DQN was designed to 
do. Further, learning control skills through extensive practice, like DQN learned how to play the Atari 
games, is just one of the types of learning humans routinely accomplish. Despite these limitations, 
DQN advanced the state-of-the-art in machine learning by impressively demonstrating the promise of 
combining reinforcement learning with modern methods of deep learning. 


16.6 Mastering the Game of Go 

The ancient Chinese game of Go has challenged artificial intelligence researchers for many decades. 
Methods that achieve human-level skill, or even superhuman-level skill, in other games have not been 
successful in producing strong Go programs. Thanks to a very active community of Go programmers 
and international competitions, the level of Go program play has improved significantly over the years. 
Until recently, however, no Go program had been able to play anywhere near the level of a human Go 
master. 
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A team at DeepMind (Silver et al., 2016) developed the program AlphaGo that broke this barrier 
by combining deep artificial neural networks (deep ANNs, Section 9.6), supervised learning, Monte 
Carlo tree search (MCTS, Section 8.11), and reinforcement learning. By the time of Silver et al.’s 
2016 publication, AlphaGo had been shown to be decisively stronger than other current Go programs, 
and it had defeated the European Go champion Fan Hui 5 games to 0. These were the first victories 
of a Go program over a professional human Go player without handicap in full Go games. Shortly 
thereafter, a similar version of AlphaGo won stunning victories over the 18-time world champion Lee 
Sedol, winning 4 out of a 5 games in a challenge match, making worldwide headline news. Artificial 
intelligence researchers thought that it would be many more years, perhaps decades, before a program 
reached this level of play. 

Here we describe AlphaGo and a successor program called AlphaGo Zero (Silver et al. 2017). Where 
in addition to reinforcement learning, AlphaGo relied on supervised learning from a large database of 
expert human moves, AlphaGo Zero used only reinforcement learning and no human data or guidance 
beyond the basic rules of the game (hence the Zero in its name). We first describe AlphaGo in some 
detail in order to highlight the relatively simplicity of AlphaGo Zero , which is both higher-performing 
and more of a pure reinforcement learning program. 

In many ways, both AlphaGo and AlphaGo Zero are descendants of Tesauo’s TD-Gammon (Sec¬ 
tion 16.1), itself a descendant of Samuel’s checkers player (Section 16.2). All these programs included 
reinforcement learning over simulated games of self-play. AlphaGo and AlphaGo Zero also built upon 
the progress made by DeepMind on playing Atari games with the program DQN (Section 16.5) that 
used deep convolutional ANNs to approximate optimal value functions. 

Go is a game between two players who alternately 
place black and white ‘stones’ on unoccupied intersec¬ 
tions, or ‘points,’ on a board with a grid of 19 hor¬ 
izontal and 19 vertical lines to produce positions like 
that shown to the right. The game’s goal is to cap¬ 
ture an area of the board larger than that captured by 
the opponent. Stones are captured according to simple 
rules. A player’s stones are captured if they are com¬ 
pletely surrounded by the other player’s stones, meaning 
that there is no horizontally or vertically adjacent point 
that is unoccupied. For example, Figure 16.6 shows on 
the left three white stones with an unoccupied adjacent 
point (labeled X). If player black places a stone on X, 
the three white stones are captured and taken off the 
board (Figure 16.6 middle). However, if player white 
were to place a stone on point X first, than the possibil¬ 
ity of this capture would be blocked (Figure 16.6 right). 

Other rules are needed to prevent infinite capturing/re¬ 
capturing loops. The game ends when neither player wishes to place another stone. These rules are 
simple, but they produce a very complex game that has had wide appeal for thousands of years. 



A Go board configuration 



Figure 16.6: Go capturing rule. Left: the three white stones are not surrounded because point X is unoccupied. 
Middle: if black places a stone on X, the three white stones are captured and removed from the board. Right: 
if white places a stone on point X first, the capture is blocked. 
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Methods that produce strong play for other games, such as chess, have not worked as well for Go. 
The search space for Go is significantly larger than that of chess because Go has a larger number of 
legal moves per position than chess (« 250 versus ss 35) and Go games tend to involve more moves 
than chess games (« 150 versus ss 80). But the size of the search space is not the major factor that 
makes Go so difficult. Exhaustive search is infeasible for both chess and Go, and Go on smaller boards, 
e.g., 9x9, has proven to be exceedingly difficult as well. Experts agree that the major stumbling 
block to creating stronger-than-amateur Go programs is the difficulty of defining an adequate position 
evaluation function. A good evaluation function allows search to be truncated at a feasible depth by 
providing relatively easy-to-compute predictions of what deeper search would likely yield. According 
to Muller (2002): “No simple yet reasonable evaluation function will ever be found for Go.” A major 
step forward was the introduction of MCTS to Go programs. The strongest programs at the time of 
AlphaGo ’s development all included MCTS, but master-level skill remained elusive. 

Recall from Section 8.11 that MCTS is a decision-time planning procedure that does not attempt 
to learn and store a global evaluation function. Like a rollout algorithm (Section 8.10), it runs many 
Monte Carlo simulations of entire episodes (here, entire Go games) to select each action (here, each 
Go move: where to place a stone or to resign). Unlike a simple rollout algorithm, however, MCTS is 
an iterative procedure that incrementally extends a search tree whose root node represents the current 
environment state. As illustrated in Figure 8.11, each iteration traverses the tree by simulating actions 
guided by statistics associated with the tree’s edges. In its basic version, when a simulation reaches a 
leaf node of the search tree, MCTS expands the tree by adding some, or all, of the leaf node’s children 
to the tree. From the leaf node, or one of its newly added child notes, a rollout is executed: a simulation 
that typically proceeds all the way to a terminal state, with actions selected by a rollout policy. When 
the rollout completes, the statistics associated with the search tree’s edges that were traversed in this 
iteration are updated by backing up the return produced by the rollout. MCTS continues this process, 
starting each time at the search tree’s root at the current state, for as many iterations as possible given 
the time constraints. Then, finally, an action from the root node (which still represents the current 
environment state) is selected according to statistics accumulated in the root node’s outgoing edges. 
This is the action the agent takes. After the environment transitions to its next state, MCTS is executed 
again with the root node set to represent the new current state. The search tree at the start of this 
next execution might be just this new root node, or it might include descendants of this node left over 
from MCTS’s previous execution. The remainder of the tree is discarded. 


16.6.1 AlphaGo 

The main innovation that made AlphaGo such a strong player is that it selected moves by a novel version 
of MCTS that was guided by both a policy and a value function learned by reinforcement learning with 
function approximation provided by deep convolutional ANNs. Another key feature is that instead of 
reinforcement learning starting from random network weights, it started from weights that were the 
result of previous supervised learning from a large collection of human expert moves. 

The DeepMind team called AlphaGo ’’s modification of basic MCTS “asynchronous policy and value 
MCTS,” or APV-MCTS. It selected actions via basic MCTS as described above but with some twists 
in how it extended its search tree and how it evaluated action edges. In contrast to basic MCTS, 
which expands its current search tree by using stored action values to select an unexplored edge from a 
leaf node, APV-MCTS, as implemented in AlphaGo , expanded its tree by choosing an edge according 
to probabilities supplied by a 13-layer deep convolutional ANN, called the SL-policy network , trained 
previously by supervised learning to predict moves contained in a database of nearly 30 million human 
expert moves. 

Then, also in contrast to basic MCTS, which evaluates the newly-added state node solely by the 
return of a rollout initiated from it, APV-MCTS evaluated the node in two ways: by this return of the 
rollout, but also by a value function, vq, learned previously by a reinforcement learning method. If s 
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was the newly-added node, its value became 

v(s) = (1 - ri)vg(s) + rjG, (16.4) 

where G was the return of the rollout and r] controlled the mixing of the values resulting from these two 
evaluation methods. In AlphaGo, these values were supplied by the value network , another 13-layer deep 
convolutional ANN that was trained as we describe below to output estimated values of board positions. 
APV-MCTS’s rollouts in AlphaGo were simulated games with both players using a fast rollout policy 
provided by a simple linear network, also trained by supervised learning before play. Throughout its 
execution, APV-MCTS kept track of how many simulations passed through each edge of the search 
tree, and when its execution completed, the most-visited edge from the root node was selected as the 
action to take, here the move AlphaGo actually made in a game. 

The value network had the same structure as the deep convolutional SL policy network except that 
it had a single output unit that gave estimated values of game positions instead of the SL policy 
network’s probability distributions over legal actions. Ideally, the value network would output optimal 
state values, and it might have been possible to approximate the optimal value function along the lines 
of TD-Gammon described above: self-play with nonlinear TD(A) coupled to a deep convolutional ANN. 
But the DeepMind team took a different approach that held more promise for a game as complex as 
Go. They divided the process of training the value network into two stages. In the first stage, they 
created the best policy they could by using reinforcement learning to train an RL policy network. This 
was a deep convolutional ANN with the same structure as the SL policy network. It was initialized 
with the final weights of the SL policy network that were learned via supervised learning, and then 
policy-gradient reinforcement learning was used to improve upon the SL policy. In the second stage of 
training the value network, the team used Monte Carlo policy evaluation on data obtained from a large 
number of simulated self-play games with moves selected by the RL policy network. 

Figure 16.7 illustrates the networks used by AlphaGo and the steps taken to train them in what the 
DeepMind team called the 11 AlphaGo pipeline.” All these networks were trained before any live game 
play took place, and their weights remained fixed throughout live play. 

Here is some more detail about AlphaGo 's ANNs and their training. The identically-structured SL 
and RL policy networks were similar to DQN’s deep convolutional network described in Section 16.5 


Rollout Policy SL policy network 


RL policy Network 
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Figure 16.7: AlphaGo pipeline. Adapted with permission from Macmillan Publishers Ltd: Nature, vol. 
529(7587), p. 485, copyright (2016). 
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for playing Atari games, except that they had 13 convolutional layers with the final layer consisting 
of a softmax unit for each point on the 19 x 19 Go board. The networks’ input was a 19 x 19 x 
48 image stack in which each point on the Go board was represented by the values of 48 binary or 
integer-valued features. For example, for each point, one feature indicated if the point was occupied 
by one of AlphaGo's stones, one of its opponent’s stones, or was unoccupied, thus providing the “raw” 
representation of the board configuration. Other features were based on the rules of Go, such as the 
number of adjacent points that were empty, the number of opponent stones that would be captured by 
placing a stone there, the number of turns since a stone was placed there, and other features that the 
design team considered to be important. 

Training the SL policy network took approximately 3 weeks using a distributed implementation 
of stochastic gradient ascent on 50 processors. The network achieved 57% accuracy, where the best 
accuracy achieved by other groups at the time of publication was 44.4%. Training the RL policy 
network was done by policy gradient reinforcement learning over simulated games between the RL 
policy network’s current policy and opponents using policies randomly selected from policies produced by 
earlier iterations of the learning algorithm. Playing against a randomly selected collection of opponents 
prevented overfitting to the current policy. The reward signal was +1 if the current policy won, —1 if 
it lost, and zero otherwise. These games directly pitted the two policies against one another without 
involving MCTS. By simulating many games in parallel on 50 processors, the DeepMind team trained 
the RL policy network on a million games in a single day. In testing the final RL policy, they found 
that it won more than 80% of games played against the SL policy, and it won 85% of games played 
against a Go program using MCTS that simulated 100,000 games per move. 

The value network, whose structure was similar to that of the SL and RL policy networks except for 
its single output unit, received the same input as the SL and RL policy networks with the exception that 
there was an additional binary feature giving the current color to play. Monte Carlo policy evaluation 
was used to train the network from data obtained from a large number of self-play games played using 
the RL policy. To avoid overfitting and instability due to the strong correlations between positions 
encountered in self-play, the DeepMind team constructed a data set of 30 million positions each chosen 
randomly from a unique self-play game. Then training was done using 50 million mini-batches each of 
32 positions drawn from this data set. Training took one week on 50 GPUs. 

The rollout policy was learned learned prior to play by a simple linear network trained by supervised 
learning from a corpus of 8 million human moves. The rollout policy network had to output actions 
quickly while still being reasonably accurate. In principle, the SL or RL policy networks could have 
been used in the rollouts, but the forward propagation through these deep networks took too much time 
for either of them to be used in rollout simulations, a great many of which had to be carried out for 
each move decision during live play. For this reason, the rollout policy network was less complex than 
the other policy networks, and its input features could be computed more quickly than the features 
used for the policy networks. The rollout policy network allowed approximately 1,000 complete game 
simulations per second to be run on each of the processing threads that AlphaGo used. 

One may wonder why the SL policy was used instead of the better RL policy to select actions in 
the expansion phase of APV-MCTS. These policies took the same amount of time to compute since 
they used the same network architecture. The team actually found that AlphaGo played better against 
human opponents when APV-MCTS used as the SL policy instead of the RL policy. They conjectured 
that the reason for this was that the latter was tuned to respond to optimal moves rather than to the 
broader set of moves characteristic of human play. Interestingly, the situation was reversed for the value 
function used by APV-MCTS. They found that when APV-MCTS used the value function derived from 
the RL policy, it performed better than if it used the value function derived from the SL policy. 

Several methods worked together to produce AlphaGo 's impressive playing skill. The DeepMind 
team evaluated different versions of AlphaGo in order to asses the contributions made by these various 
components. The parameter 77 in (16.4) controlled the mixing of game state evaluations produced by 



370 


CHAPTER 16. APPLICATIONS AND CASE STUDIES 


the value network and by rollouts. With r) = 0, AlphaGo used just the value network without rollouts, 
and with rj = 1, evaluation relied just on rollouts. They found that AlphaGo using just the value 
network played better than the rollout-only AlphaGo , and in fact played better than the strongest of 
all other Go programs existing at the time. The best play resulted from setting p = 0.5, indicating 
that combining the value network with rollouts was particularly important to AlphaGo 's success. These 
evaluation methods complemented one another: the value network evaluated the high-performance RL 
policy that was too slow to be used in live play, while rollouts using the weaker but much faster rollout 
policy were able to add precision to the value network’s evaluations for specific states that occurred 
during games. 

Overall, AlphaGo's remarkable success fueled a new round of enthusiasm for the promise of artificial 
intelligence, specifically for systems combining reinforcement learning with deep ANNs, to address 
problems in other challenging domains. 


16.6.2 AlphaGo Zero 

Building upon the experience with AlphaGo , a DeepMind team developed AlphaGo Zero (Silver et al. 
2017). In contrast to AlphaGo , this program used no human data or guidance beyond the basic rules 
of the game (hence the Zero in its name). It learned exclusively from self-play reinforcement learning, 
with input giving just “raw” descriptions of the placements of stones on the Go board. AlphaGo 
Zero implemented a form of policy iteration (Section 4.3), interleaving policy evaluation with policy 
improvement. Figure 16.8 is an overview of AlphaGo Zero's algorithm. A significant difference between 
AlphaGo Zero and AlphaGo is that AlphaGo Zero used MCTS to select moves throughout self-play 
reinforcement learning, whereas AlphaGo used MCTS for live play after— but not during—learning. 
Other differences besides not using any human data or human-crafted features are that AlphaGo Zero 
used only one deep convolutional ANN and used a simpler version of MCTS. 

AlphaGo Zero’s MCTS was simpler than the version used by AlphaGo in that it did not include 
rollouts of complete games, and therefore did not need a rollout policy. Each iteration of AlphaGo 
Zero’s MCTS ran a simulation that ended at a leaf node of the current search tree instead of at the 
terminal position of a complete game simulation. But as in AlphaGo , each iteration of MCTS in 
AlphaGo Zero was guided by the output of a deep convolutional network, labeled fg in Figure 16.7, 
were 6 is the network’s weight vector. The input to the network, whose architecture we describe below, 
consisted of raw representations of board positions, and its output had two parts: a scaler value, v, an 
estimate of the probability that the current player will win from from the current board position, and 
a vector, p, of move probabilities, one for each possible stone placement on the current board, plus the 
pass, or resign, move. 

Instead of selecting self-play actions according to the probabilities p, however, AlphaGo Zero used 
these probabilities, together with the network’s value output, to direct each execution of MCTS, which 
returned new move probabilities, shown in Figure 16.7 as the policies 7r,;. These policies benefitted from 
the many simulations that MCTS conducted each time it executed. The result was that the policy 
actually followed by AlphaGo Zero was an improvement over the policy given by the network’s outputs 
p. Silver et al. (2017) wrote that “MCTS may therefore be viewed as a powerful policy improvement 
operator.” 

Here is more detail about AlphaGo Zero’s ANN and how it was trained. The network took as input 
a 19 X 19 x 17 image stack consisting of 17 binary feature planes. The first 8 feature planes were 
raw representations of the positions of the current player’s stones in the current and seven past board 
configurations: a feature value was 1 if a player’s stone was on the corresponding point, and was 0 
otherwise. The next 8 feature planes similarly coded the positions of the opponent’s stones. A final 
input feature plane had a constant value indicating the color of the current play: 1 for black; 0 for 
white. Because repetition is not allowed in Go and one player is given some number of “compensation 
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Figure 16.8: AlphaGo Zero self-play reinforcement learning, a) The program played many games against itself, 
one shown here as a sequence of board positions Si, i = 1,2,... ,T, with moves di, i = 1,2,... ,T, and winner 
z. Each move cn was determined by action probabilities ^ r; returned by MCTS executed from root node s; and 
guided by a deep convolutional network, here labeled fg, with latest weights 6. Shown here for just one position 
s but repeated for all ,s,, the network’s inputs were raw representations of board positions s; (together with 
several past position, though not shown here), and its outputs were vectors p of move probabilities that guided 
MCTS’s forward searches, and scalar values v that estimated the probability of the current player winning from 
each position Si. b) Deep convolutional network training. Training examples were randomly sampled steps from 
recent self-play games. Weights 6 were updated to move the policy vector p toward the probabilities n returned 
by MCTS, and to include the winners z in the estimated win probability v. Reprinted from draft of Silver et 
al. (2017) with permission of the authors and DeepMind. 


points” for not getting the first move, the current board position is not a Markov state of Go. This is 
why features describing past board positions and the color feature were needed. 

The network was “two-headed,” meaning that after a number of initial layers, the network split into 
two separate “heads” of additional layers that separately fed into two sets of output units. In this 
case, one head fed 362 output units producing 19 2 + 1 move probabilities p, one for each possible stone 
placement plus pass; the other head fed just one output unit producing the scalar v, an estimate of 
the probability that the current player will win from the current board position. The network before 
the split consisted of 41 convolutional layers, each followed by batch normalization, and with skip 
connections added to implement residual learning by pairs of layers (see Section 9.6). Overall, move 
probabilities and values were computed by 43 and 44 layers respectively. 

Starting with random weights, the network was trained by stochastic gradient descent (with momen¬ 
tum, regularization, and step-size parameter decreasing as training continues) using batches of examples 
sampled uniformly at random from all the steps of the most recent 500,000 games of self-play with the 
current best policy. Extra noise was added to the network’s output p to encourage exploration of all 
possible moves. At periodic checkpoints during training, which Silver et al. (2017) chose to be at every 
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1,000 training steps, the policy output by the ANN with the latest weights was evaluated by simulating 
400 games (using MCTS with 1,600 iterations to select each move) against the current best policy. If 
the new policy won (by a margin set to reduce noise in the outcome), then it became the best policy 
to be used in subsequent self-play. The network’s weights were updated to make the network’s policy 
output p more closely match the policy returned by MCTS, and to make its value output, v, more 
closely match the probability that the current best policy wins from the board position represented by 
the network’s input. 

The DeepMind team trained AlphaGo Zero over 4.9 million games of self-play, which took about 3 
days. Each move of each game was selected by running MCTS for 1,600 iterations, taking approximately 
0.4 second per move. Network weights were updated over 700,000 batches each consisting of 2,048 board 
configurations. They then ran tournaments with the trained AlphaGo Zero playing against the version 
of AlphaGo that defeated Fan Hui by 5 games to 0, and against the version that defeated Lee Sedol by 
4 games to 1. They used the Elo rating system to evaluate the relative performances of the programs. 
The difference between two Elo ratings is meant to predict the outcome of games between the players. 
The Elo ratings of AlphaGo Zero, the version of AlphaGo that played against Fan Hui, and the version 
that played against Lee Sedol were respectively 4,308, 3,144, and 3,739. The gaps in these Elo ratings 
translate into predictions that AlphaGo Zero would defeat these other programs with probabilities very 
close to one. In a match of 100 games between AlphaGo Zero, trained as described, and the exact 
version of AlphaGo that defeated Lee Sedol held under the same conditions that were used in that 
match, AlphaGo Zero defeated AlphaGo in all 100 games. 

The DeepMind team also compared AlphaGo Zero with a program using an ANN with the same 
architecture but trained by supervised learning to predict human moves in a data set containing nearly 
30 million positions from 160,000 games. They found that the supervised-learning player initially played 
better than AlphaGo Zero, and was better at predicting human expert moves, but played less well after 
AlphaGo Zero was trained for a day. This suggested that AlphaGo Zero had discovered a strategy for 
playing that was different from how humans play. In fact, AlphaGo Zero discovered, and came to prefer, 
some novel variations of classical move sequences. 

Final tests of AlphaGo Zero's algorithm were conducted with a version having a larger ANN and 
trained over 29 million self-play games, which took about 40 days, again starting with random weights. 
This version achieved an Elo rating of 5,185. The team pitted this version of AlphaGo Zero against 
a program called AlphaGo Master, the strongest program at the time, that was identical to AlphaGo 
Zero but, like AlphaGo, used human data and features. AlphaGo Master's Elo rating was 4,858, and 
it had defeated the strongest human professional players 60 to 0 in online games. In a 100 game 
match, AlphaGo Zero with the larger network and more extensive learning defeated AlphaGo Master 
89 games to 11, thus providing a convincing demonstration of the problem-solving power of AlphaGo 
Zero's algorithm. 

AlphaGo Zero soundly demonstrated that superhuman performance can be achieved by pure re¬ 
inforcement learning, augmented by a simple version of MCTS, and deep ANNs with very minimal 
knowledge of the domain and no reliance on human data or guidance. We will surely see systems in¬ 
spired by the DeepMind accomplishments of both AlphaGo and AlphaGo Zero applied to challenging 
problems in other domains. 


16.7 Personalized Web Services 

Personalizing web services such as the delivery of news articles or advertisements is one approach to 
increasing users’ satisfaction with a website or to increase the yield of a marketing campaign. A policy 
can recommend content considered to be the best for each particular user based on a profile of that 
user’s interests and preferences inferred from their history of online activity. This is a natural domain 
for machine learning, and in particular, for reinforcement learning. A reinforcement learning system 
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can improve a recommendation policy by making adjustments in response to user feedback. One way 
to obtain user feedback is by means of website satisfaction surveys, but for acquiring feedback in real 
time it is common to monitor user clicks as indicators of interest in a link. 

A method long used in marketing called A/B testing is a simple type of reinforcement learning used to 
decide which of two versions, A or B, of a website users prefer. Because it is non-associative, like a two¬ 
armed bandit problem, this approach does not personalize content delivery. Adding context consisting 
of features describing individual users and the content to be delivered allows personalizing service. This 
has been formalized as a contextual bandit problem (or an associative reinforcement learning problem, 
Section 2.9) with the objective of maximizing the total number of user clicks. Li, Chu, Langford, and 
Schapire (2010) applied a contextual bandit algorithm to the problem of personalizing the Yahoo! Front 
Page Today webpage (one of the most visited pages on the internet at the time of their research) by 
selecting the news story to feature. Their objective was to maximize the click-through rate (CTR), 
which is the ratio of the total number of clicks all users make on a webpage to the total number of 
visits to the page. Their contextual bandit algorithm improved over a standard non-associative bandit 
algorithm by 12.5%. 

Theocharous, Thomas, and Gliavamzadeh (2015) argued that better results are possible by for¬ 
mulating personalized recommendation as a Markov decision problem (MDP) with the objective of 
maximizing the total number of clicks users make over repeated visits to a website. Policies derived 
from the contextual bandit formulation are greedy in the sense that they do not take long-term effects 
of actions into account. These policies effectively treat each visit to a website as if it were made by a 
new visitor uniformly sampled from the population of the website’s visitors. By not using the fact that 
many users repeatedly visit the same websites, greedy policies do not take advantage of possibilities 
provided by long-term interactions with individual users. 

As an example of how a marketing strategy might take advantage of long-term user interaction, 
Theocharous et al. contrasted a greedy policy with a longer-term policy for displaying ads for buying 
a product, say a car. The ad displayed by the greedy policy might offer a discount if the user buys the 
car immediately. A user either takes the offer or leaves the website, and if they ever return to the site, 
they would likely see the same offer. A longer-term policy, on the other hand, can transition the user 
“down a sales funnel” before presenting the final deal. It might start by describing the availability of 
favorable financing terms, then praise an excellent service department, and then, on the next visit, offer 
the final discount. This type of policy can result in more clicks by a user over repeated visits to the 
site, and if the policy is suitably designed, more eventual sales. 

Working at Adobe Systems Incorporated, Theocharous et al. conducted experiments to see if policies 
designed to maximize clicks over the long term could in fact improve over short-term greedy policies. 
The Adobe Marketing Cloud, a set of tools that many companies use to to run digital marketing 
campaigns, provides infrastructure for automating user-targed advertising and fund-raising campaigns. 
Actually deploying novel policies using these tools entails significant risk because a new policy may end 
up performing poorly. For this reason, the research team needed to assess what a policy’s performance 
would be if it were to be actually deployed, but to do so on the basis of data collected under the 
execution of other policies. A critical aspect of this research, then, was off-policy evaluation. Further, 
the team wanted to do this with high confidence to reduce the risk of deploying a new policy. Although 
high confidence off-policy evaluation was a central component of this research (see also Thomas, 2015; 
Thomas, Theocharous, and Ghavamzadeh, 2015), here we focus only on the algorithms and their results. 

Theocharous et al. compared the results of two algorithms for learning ad recommendation policies. 
The first algorithm, which they called greedy optimization , had the goal of maximizing only the proba¬ 
bility of immediate clicks. As in the standard contextual bandit formulation, this algorithm did not take 
the long-term effects of recommendations into account. The other algorithm, a reinforcement learning 
algorithm based on an MDP formulation, aimed at improving the number of clicks users made over 
multiple visits to a website. They called this latter algorithm life-time value (LTV) optimization. Both 
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algorithms faced challenging problems because the reward signal in this domain is very sparse since 
users usually do not click on ads, and user clicking is very random so that returns have high variance. 

Data sets from the banking industry were used for training and testing these algorithms. The data 
sets consisted of many complete trajectories of customer interaction with a bank’s website that showed 
each customer one out of a collection of possible offers. If a customer clicked, the reward was 1, 
and otherwise it was 0. One data set contained approximately 200,000 interactions from a month of a 
bank’s campaign that randomly offered one of 7 offers. The other data set from another bank’s campaign 
contained 4,000,000 interactions involving 12 possible offers. All interactions included customer features 
such as the time since the customer’s last visit to the website, the number of their visits so far, the 
last time the customer clicked, geographic location, one of a collection of interests, and features giving 
demographic information. 

Greedy optimization was based on a mapping estimating the probability of a click as a function of 
user features. The mapping was learned via supervised learning from one of the data sets by means 
of a random forest (RF) algorithm (Breiman, 2001). RF algorithms have been widely used for large- 
scale applications in industry because they are effective predictive tools that tend not to overfit and 
are relatively insensitive to outliers and noise. Theocharous et al. then used the mapping to define 
an e-greedy policy that selected with probability 1-e the offer predicted by the RF algorithm to have 
the highest probability of producing a click, and otherwise selected from the other offers uniformly at 
random. 

LTV optimization used a batch-mode reinforcement learning algorithm called fitted Q iteration (FQI). 
It is a variant of fitted value iteration (Gordon, 1999) adapted to Q-learning. Batch mode means 
that the entire data set for learning is available from the start, as opposed to the on-line mode of 
the algorithms we focus on in this book in which data are acquired sequentially while the learning 
algorithm executes. Batch-mode reinforcement learning algorithms are sometimes necessary when on¬ 
line learning is not practical, and they can use any batch-mode supervised learning regression algorithm, 
including algorithms known to scale well to high-dimensional spaces. The convergence of FQI depends 
on properties of the function approximation algorithm (Gordon, 1999). For their application to LTV 
optimization, Theocharous et al. used the same RF algorithm they used for the greedy optimization 
approach. Since in this case FQI convergence is not monotonic, Theocharous et al. kept track of the 
best FQI policy by off-policy evaluation using a validation training set. The final policy for testing 
the LTV approach was the e-greedy policy based on the best policy produced by FQI with the initial 
action-value function set to the mapping produced by the RF for the greedy optimization approach. 

To measure the performance of the policies produced by the greedy and LTV approaches, Theocharous 
et al. used the CTR metric and a metric they called the LTV metric. These metrics are similar, except 
that the LTV metric critically distinguishes between individual website visitors: 

= Total # of Clicks 
Total # of Visits ’ 

^ Total # of Clicks 
Total # of Visitors' 

Figure 16.9 illustrates how these metrics differ. Each circle represents a user visit to the site; black 
circles are visits at which the user clicks. Each row represents visits by a particular user. By not 
distinguishing between visitors, the CTR for these sequences is 0.35, whereas the LTV is 1.5. Because 
LTV is larger than CTR to the extent that individual users revisit the site, it is an indicator of how 
successful a policy is in encouraging users to engage in extended interactions with the site. 

Testing the policies produced by the greedy and LTV approaches was done using a high confidence 
off-policy evaluation method on a test data set consisting of real-world interactions with a bank website 
served by a random policy. As expected, results showed that greedy optimization performed best as 
measured by the CTR metric, while LTV optimization performed best as measured by the LTV metric. 
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Figure 16.9: Click through rate (CTR) versus life-time value (LTV). Each circle represents a user visit; black 
circles are visits at which the user clicks. Adapted from Theocharous et al. (2015). 


Furthermore—although we have omitted its details—the high confidence off-policy evaluation method 
provided probabilistic guarantees that the LTV optimization method would, with high probability, pro¬ 
duce policies that improve upon policies currently deployed. Assured by these probabilistic guarantees, 
Adobe announced in 2016 that the new LTV algorithm would be a standard component of the Adobe 
Marketing Cloud so that a retailer could issue a sequence of offers following a policy likely to yield 
higher return than a policy that is insensitive to long-term results. 


16.8 Thermal Soaring 

Birds and gliders take advantage of upward air currents—thermals—to gain altitude in order to maintain 
flight while expending little, or no, energy. Thermal soaring, as this behavior is called, is a complex skill 
requiring responding to subtle environmental cues to increase altitude by exploiting a rising column of 
air for as long as possible. Reddy, Celani, Sejnowski, and Vergassola (2016) used reinforcement learning 
to investigate thermal soaring policies that are effective in the strong atmospheric turbulence usually 
accompanying rising air currents. Their primary goal was to provide insight into the cues birds sense 
and how they use them to achieve their impressive thermal soaring performance, but the results also 
contribute to technology relevant to autonomous gliders. Reinforcement learning had previously been 
applied to the problem of navigating efficiently to the vicinity of a thermal updraft (Woodbury, Dunn, 
and Valasek, 2014) but not to the more challenging problem of soaring within the turbulence of the 
updraft itself. 

Reddy et al. modeled the soaring problem as an MDP. The agent interacted with a detailed model 
of a glider flying in turbulent air. They devoted significant effort toward making the model generate 
realistic thermal soaring conditions, including investigating several different approaches to atmospheric 
modeling. For the learning experiments, air flow in a three-dimensional box with one kilometer sides, 
one of which was at ground level, was modeled by a sophisticated physics-based set of partial differential 
equations involving air velocity, temperature, and pressure. Introducing small random perturbations 
into the numerical simulation caused the model to produce analogs of thermal updrafts and accom¬ 
panying turbulence (Figure 16.10 Left) Glider flight was modeled by aerodynamic equations involving 
velocity, lift, drag, and other factors governing powerless flight of a fixed-wing aircraft. Maneuvering 
the glider involved changing its angle of attack (the angle between the glider’s wing and the direction 
of air flow) and its bank angle (Figure 16.10 Right). 

The interface between the agent and the environment required defining the agent’s actions, the state 
information the agent receives from the environment, and the reward signal. By experimenting with 
various possibilities, Reddy et al. decided that three actions each for the angle of attack and the bank 
angle were enough for their purposes: increment or decrement the current bank angle and angle of 
attack by 5° and 2.5°, respectively, or leave them unchanged. This resulted in 3 2 possible actions. The 
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Figure 16.10: Thermal soaring model: Left: snapshot of the vertical velocity field of the simulated cube of air: 
in light (dark) grey is a region of large upward (downward) flow. Right: diagram of powerless flight showing 
bank angle /r and angle of attack a. Adapted with permission From PNAS vol. 113(22), p. E4879, 2016, Reddy, 
Celani, Sejnowski, and Vergassola, Learning to Soar in Turbulent Environments. 


bank angle was bounded to remain between —15° and +15°. 

Because a goal of their study was to try to determine what minimal set of sensory cues are necessary 
for effective soaring, both to shed light on the cues birds might use for soaring and to minimize the 
sensing complexity required for automated glider soaring, the authors tried various sets of signals as 
input to the reinforcement learning agent. They started by using state aggregation (Chapter 9) of 
a four-dimensional state space with dimensions giving local vertical wind speed, local vertical wind 
acceleration, torque depending on the difference between the vertical wind velocities at the left and 
right wing tips, and the local temperature. Each dimension was discretized into three bins: positive 
high, negative high, and small. Results, described below, showed that only two of these dimensions 
were critical for effective soaring behavior. 

The overall objective of thermal soaring is to gain as much altitude as possible from each rising 
column of air. Reddy et al. tried a straightforward reward signal that rewarded the agent at the end of 
each episode based on the altitude gained over the episode, a large negative reward signal if the glider 
touched the ground, and zero otherwise. They found that learning was not successful with this reward 
signal for episodes of realistic duration and that eligibility traces did not help. By experimenting with 
various reward signals, they found that learning was best with a reward signal that at each time step 
linearly combined the vertical wind velocity and vertical wind acceleration observed on the previous 
time step. 

Learning was by Sarsa with action selection using softmax applied to action values normalized to 
the interval [0,1]. The temperature parameter was initialized to 2.0 and incrementally decreased to 0.2 
during learning. The step-size and discount-rate parameters were fixed at 0.1 and 0.98 respectively. Each 
learning episode took place with the agent controlling simulated flight in an independently generated 
period of simulated turbulent air currents. Each episode lasted 2.5 minutes simulated with a 1 second 
time step. Learning effectively converged after a few hundred episodes. The left panel of Figure 16.11 
shows a sample trajectory before learning where the agent selects actions randomly. Starting at the top 
of the volume shown, the glider’s trajectory is in the direction indicated by the arrow and quickly loses 
altitude. Figure 16.11’s right panel is a trajectory after learning. The glider starts at the same place 
(here appearing at the bottom of the volume) and gains altitude by spiraling within the rising column 
of air. Although Reddy at al. found that performance varied widely over different simulated periods 
of air flow, the number of times the glider touched the ground consistently decreased to nearly zero as 
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Figure 16.11: Sample thermal soaring trajectories, with arrows showing the direction of flight from the same 
starting point (note that the altitude scales are shifted). Left: before learning: the agent selects actions randomly 
and the glider descends. Right: after learning: the glider gains altitude by following a spiral trajectory. Adapted 
with permission from PNAS vol. 113(22), p. E4879, 2016, Reddy, Celani, Sejnowski, and Vergassola, Learning 
to Soar in Turbulent Environments. 


learning progressed. 

After experimenting with different sets of features available to the learning agent, it turned out that 
the combination of just vertical wind acceleration and torques worked best. The authors conjectured 
that because these features give information about the gradient of vertical wind velocity in two different 
directions, they allow the controller to select between turning by changing the bank angle or continuing 
along the same course by leaving the bank angle alone. This allows the glider to stay within a rising 
column of air. Vertical wind velocity is indicative of the strength of the thermal but does not help in 
staying within the flow. They found that sensitivity to temperature was of little help. They also found 
that controlling the angle of attack is not helpful in staying within a particular thermal, being useful 
instead for traveling between thermals when covering large distances, as in cross-country gliding and 
bird migration. 

Since soaring in different levels of turbulence requires different policies, training was done in conditions 
ranging from weak to strong turbulence. In strong turbulence the rapidly changing wind and glider 
velocities allowed less time for the controller to react. This reduced the amount of control possible 
compared to what was possible for maneuvering when fluctuations were weak. Reddy at al. examined 
the policies Sarsa learned under these different conditions. Common to policies learned in all regimes 
were these features: when sensing negative wind acceleration, bank sharply in the direction of the wing 
with the higher lift; when sensing large positive wind acceleration and no torque, do nothing. However, 
different levels of turbulence led to policy differences. Policies learned in strong turbulence were more 
conservative in that they preferred small bank angles, whereas in weak turbulence, the best action was 
to turn as much as possible by banking sharply. Systematic study of the bank angles preferred by the 
policies learned under the different conditions led the authors to suggest that by detecting when vertical 
wind acceleration crosses a certain threshold the controller can adjust its policy to cope with different 
turbulence regimes. 

Reddy et al. also conducted experiments to investigate the effect of the discount-rate parameter 7 on 
the performance of the learned policies. They found that the altitude gained in an episode increased as 
7 increased, reaching a maximum for 7 = .99, suggesting that effective thermal soaring requires taking 
into account long-term effects of control decisions. 
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This computational study of thermal soaring illustrates how reinforcement learning can further 
progress toward different kinds of objectives. Learning policies having access to different sets of environ¬ 
mental cues and control actions contributes to both the engineering objective of designing autonomous 
gliders and the scientific objective of improving understanding of the soaring skills of birds. In both 
cases, hypotheses resulting from the learning experiments can be tested in the field by instrumenting 
real gliders and by comparing predictions with observed bird soaring behavior. 



Chapter 17 


Frontiers 


In this final chapter we touch on some topics that are beyond the scope of this book but that we see as 
particularly important for the future of reinforcement learning. Many of these topics bring us beyond 
what is reliably known, and some bring us beyond the MDP framework. 


17.1 General Value Functions and Auxiliary Tasks 


Over the course of this book, our notion of value function has become quite general. With off-policy 
learning we allowed a value function to be conditional on an arbitrary target policy. Then in Section 12.8 
we generalized discounting to a termination function 7 : § 1 —> [ 0 , 1 ], so that a different discount rate could 
be applied at each time step in determining the return (12.24). This allowed us to express predictions 
about how much reward we will get over an arbitrary, state-dependent horizon. The next, and perhaps 
final, step is to generalize beyond rewards to permit predictions about arbitrary other signals. Rather 
than predicting the sum of future rewards, we might predict the sum of the future values of a sound or 
color sensation, or of an internal, highly processed signal such as another prediction. Whatever signal 
is added up in this way in a value-function-like prediction, we call it the cumulant of that prediction. 
We formalize it in a cumulant signal C t G R. Using this, a general value function, or GVF, is written 
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(17.1) 


As with conventional value functions (such as v T or < 7 *) this is an ideal function that we seek to 
approximate with a parameterized form, which we might continue to denote i)(s, w), although of course 
there would have to be a different w for each prediction, that is, for each choice of 7r, 7 , and C t . Because 
a GVF has no necessory connection to reward, it is perhaps a misnomer to call it a value function. 
One could call it simply a prediction or, to make it more distinctive, a forecast (Ring, in preparation). 
Whatever it is called, it is in the form of a value function and thus can be learned in the usual ways 
using the methods developed in this book for learning approximate value functions. Along with the 
learned predictions, we might also learn policies to maximize the predictions in the usual GPI ways by 
greedification, or by actor-critic methods. In this way an agent could learn to predict and control great 
numbers of signals, not just long-term reward. 

Why might it be useful to predict and control signals other than long-term reward? These are 
auxiliary tasks in that they are extra, in-addition-to, the main task of maximizing reward. One answer 
is that the ability to predict and control a diverse multitude of signals can constitute a powerful kind of 
environmental model. As we saw in Chapter 8 , a good model can enable the agent to get reward more 
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efficiently. It takes a couple of further concepts to develop this answer clearly, so we postpone it to the 
next section. First let’s consider two simpler ways in which a multitude of diverse predictions can be 
helpful to a reinforcement learning agent. 

One simple way in which auxiliary tasks can help on the main task is that they may require some of 
the same representations as are needed on the main task. Some of the auxiliary tasks may be easier, with 
less delay and a clearer connection between actions and outcomes. If good features can be found early 
on easy auxilary tasks, then those features may significantly speed learning on the main task. There is 
no necessary reason why this has to be true, but in many cases it seems plausible. For example, if you 
learn to predict and control your sensors over short time scales, say seconds, then you might plausibly 
come up with part of the idea of objects, which would then greatly help with the prediction and control 
of long-term reward. 

We might imagine an artificial neural network in which the last layer is split into multiple parts, 
or heads , each working on a different task. One head might produce the approximate value function 
for the main task (with reward as its cumulant) whereas the others would produce solutions to various 
auxilary tasks. All heads could propagate errors by stochastic gradient descent into the same body—the 
shared preceding part of the network—which would then try to form representations, in its next-to-last 
layer, to support all the heads. Researchers have experimented with auxiliary tasks such as predicting 
change in pixels, predicting the next-time-step’s reward, and predicting the distribution of the return. 
In many cases this approach has been shown to greatly accelerate learning on the main task (Jaderberg 
et ah, 2017). Multiple predictions have similarly been repeatedly proposed as a way of directing the 
construction of state estimates (see Section 17.3). 

Another simple way in which the learning of auxiliary tasks can improve performance is best ex¬ 
plained by analogy to the psychological phenomena of classical conditioning (Section 14.2). One way of 
understanding classical conditioning is that evolution has built in a reflexive (non-learned) association 
to a particular action from the prediction of a particular signal. For example, humans and many other 
animals appear to have a built-in reflex to blink whenever their prediction of being poked in the eye 
exceeds some threshold. The prediction is learned, but the association from prediction to blinking is 
built in, and thus the animal is saved many pokes in its eye. Similarly, the association from fear to 
increased heart rate, or to freezing, can be built in. Agent designers can do something similar, connect¬ 
ing by design (without learning) predictions of specific events to predetermined actions. For example, 
a self-driving car that learns to predict whether going forward will produce a collision could be given a 
built-in reflex to stop, or to turn away, whenever the prediction is above some threshold. Or consider 
a vacuum-cleaning robot that learned to predict whether it might run out of battery power before 
returning to the charger, and reflexively headed back to the charger whenever the prediction became 
non-zero. The correct prediction would depend on the size of the house, the room the robot was in, and 
the age of the battery, all of which would be hard for the robot designer to know. It would be difficult 
for the designer to build in a reliable algorithm for deciding whether to head back to the charger in 
sensory terms, but it might be easy to do this in terms of the learned prediction. We foresee many 
possible ways like this in which learned predictions might combine usefully with built-in algorithms for 
controlling behavior. 

Finally, perhaps the most important role for auxiliary tasks is in moving beyond the assumption 
we have made throughout this book that the state representation is fixed and given to the agent. To 
explain this role, we first have to take a few steps back to appreciate the magnitude of this assumption 
and the implications of removing it. We do that in Section 17.3. 


17.2 Temporal Abstraction via Options 

An appealing aspect of the MDP formalism is that it can be applied usefully to tasks at many different 
time scales. One can use it to formalize the task of deciding which muscles to twitch to grasp an object, 
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which airplane flight to take to arrive conveniently at a distant city, and which job to take to lead a 
satisfying life. These tasks differ greatly in their time scales, yet each can be usefully formulated as 
an MDP that can be solved by planning or learning processes as described in this book. All involve 
interaction with the world, sequential decision making, and a goal usefully conceived of as accumulating 
rewards over time, and so all can be formulated as MDPs. 

Although all these tasks can be formulated as MDPs, one might think that they cannot be formulated 
as a single MDP. They involve such different time scales, such different notions of choice and action! It 
would be no good, for example, to plan a flight across a continent at the level of muscle twitches. Yet 
for other tasks, grasping, throwing darts, or hitting a baseball, low-level muscle twitches may be just 
the right level. People do all these things seamlessly without appearing to switch between levels. Can 
the MDP framework be stretched to cover all the levels simultaneously? 

Perhaps it can. One popular idea is to formalize an MDP at a detailed level, with a small time step, 
yet enable planning at higher levels using extended courses of action that correspond to many base-level 
time steps. To do this we need a notion of course of action that extends over many time steps and 
includes a notion of termination. A general way to formulate these two ideas are as a policy, ir, and a 
state-dependent termination function, 7 , as in GVFs. We define a pair of these as a generalized notion of 
action termed an option. To execute an option ui = { 7T W ,7 W ) at time t is to obtain the action to take, A t , 
from 7 r^(-| 5 't), then terminate at time t+ 1 with probability 7 w (<St+i). If the option does not terminate, 
then At- 1-1 is selected from 7 r^(-|S' t+ i), the option terminates at t + 2 with probability J ll] (St+ 2 ), and 
so on until eventual termination. Such options are a strict generalization of low-level actions, which 
correspond to options (itwilj) in which the policy always picks the same action (n U! (s) = a for all s £ S) 
and in which the termination condition always terminates ( 7 w (s) =0 for all s S § + ). Options effectively 
extend the action space. The agent can either select a primitive action, terminating after one time step, 
or select an extended option that might execute for many time steps before terminating. 

Options are designed so that they are interchangable with (primitive) actions. For example, the 
notion of an action-value function q n naturally generalizes to an option -value function that takes a state 
and option as input and returns the expected return starting from that state, executing that option 
to termination, and thereafter following the policy, 7 r. We can also generalize the notion of policy 
to a hierarchical policy that selects from options rather than actions, where options, when selected, 
execute until termination. With these ideas, many of the algorithms in this book can be generalized to 
learn approximate option-value functions and hierarchical policies. In the simplest case, the learning 
process ‘jumps’ from option initiation to option termination, with an update only occurring when an 
option terminates. More subtly, updates can be made on each time step, using intra-option learning 
algorithms, which in general require off-policy learning. 

Perhaps the most important generalization made possible by option ideas is that of the environmental 
model as developed in Chapters 3, 4 and 8 . The conventional model of an action is the state-transition 
probabilities and the expected immediate reward for taking the action in each state. How do conven¬ 
tional action models generalize to option models ? For options, the appropriate model is again of two 
parts, one corresponding to the state transition resulting from executing the option and one correspond¬ 
ing to the expected cumulative reward along the way. The reward part of an option model, analogous 
to the expected reward for state-action pairs (3.5), is 

r(s, w) = E[i?i + 7 R 2 +7 2 -^3 H-f- 7 T_ 1 i?r | So = s, A 0:r _i ~ 7 r a; , t~ 7 w ] , (17.2) 

for all options w and all states s € §, where r is the random time step at which the option terminates 
according to r ) u] . Note the role of the overall discounting parameter 7 in this equation—discounting is 
according to 7 , but termination of the option is according to 7 W . The state-transition part of an option 
model is a little more subtle. This part of the model characterizes the probability of each possible 
resulting state (as in (3.4)), but now this state may result after various numbers of time steps, each of 
which must be discounted differently. The model for option w specifies, for each state s that to might 
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start executing in, and for each state s' that to might terminate in, 

OO 

p(s'\s,u) = ^Pr {S k = s',r = k I S 0 = s, (17.3) 

k =1 

Note that this p(s , |s, w) is no longer a transition probability and no longer sums to one over all values 
of s'. (Nevertheless, we continue to use the ‘|’ notation in p.) 

The above definition of the transition part of an option model allows us to formulate Bellman equa¬ 
tions and dynamic programming algorithms that apply to all options, including primitive actions as a 
special case. For example, the general Bellman equation for the state values of a hierarchical policy 7 r 
is 


M s ) = ^Is) 
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(17.4) 


where fl(s) denotes the set of options available in state s. If f2(s) includes only the primitive actions, then 
this equation reduces to a version of the usual Bellman equation (3.14), except of course 7 is included 
in the new p (17.3) and thus does not appear. Similarly, the corresponding planning algorithms also 
have no 7 . For example, the value iteration algorithm with options, analogous to (4.10), is 


Vk+i(s) 
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for all s € S. 


(17.5) 


If f2(s) includes all the primitive actions available in each s, then this algorithm converges to the 
conventional i>*, from which the optimal policy can be computed. However, it is particularly useful to 
plan with options when only a subset of the possible options are considered (in Q(s)) in each state. 
Value iteration will then converge to the best hierarchical policy limited to the restricted set of options. 
Although this policy may be sub-optimal, convergence can be much faster because fewer options are 
considered and because each option can jump over many time steps. 

To plan with options, one must either be given the option models, or learn them. One natural way 
to learn an option model is to formulate it as a collection of GVFs (as defined in the preceding section) 
and then learn the GVFs using the methods presented in this book. It is not difficult to see how this 
could be done for the reward part of the option model. One merely chooses one GVF’s cumulant to be 
the reward (Ct = Rt), its policy to be the the option’s policy ( 7 r = 7 r w ), and its termination function 
to be the discount rate times the option’s termination function (y(s) = 7 • 7 ^(s)). The true GVF then 
equals the reward part of the option model, 17 ^ 7 ,c( s ) = r(s,ui), and the learning methods described in 
this book can be used to approximate it. The state-transition part of the option model is only a little 
more complicated. One needs to allocate one GVF for each state that the option might terminate in. 
We don’t want these GVFs to accumulate anything except when the option terminates, and then only 
when the termination is in the appropriate state. This can be achieved by choosing the cumulant of the 
GVF that predicts transition to state s' to be C t = 7 ( 6 )) • ls t=s '. The GVF’s policy and termination 
functions are chosen the same as for the reward part of the option model. The true GVF then equals the 
s' portion of the option’s state-transition model, = p(s' |s,w), and again this book’s methods 

could be employed to learn it. Although each of these steps is seemingly natural, putting them all 
together (including function approximation and other essential components) is quite challenging and 
beyond the current state of the art. 

Exercise 17.1 This section has presented options for the discounted case, but discounting is arguably 
inappropriate for control when using function approximation (Section 10.4). What is the natural Bell¬ 
man equation for a hierarchical policy, analogous to (17.4), but for the average reward setting (Sec¬ 
tion 10.3)? What are the two parts of the option model, analogous to (17.2) and (17.3), for the average 
reward setting? 
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17.3 Observations and State 

Throughout this book we have written the learned approximate value functions (and the policies in 
Chapter 13) as functions of the environment’s state. This is a significant limitation of the methods 
presented in Part I, in which the learned value function was implemented as a table such that any 
value function could be exactly approximated; that case is tantamount to assuming that the state of 
the environment is completely observed by the agent. But in many cases of interest, and certainly in 
the lives of all natural intelligences, the sensory input gives only partial information about the state 
of the world. Some objects may be ocluded by others, or behind the agent, or miles away. In these 
cases, potentially important aspects of the environment’s state are not directly observable, and it is a 
strong, unrealistic, and limiting assumption to assume that the learned value function is implemented 
as a table over the environment’s state space. 

On the other hand, the framework of parametric function approximation that we developed in Part 
II is far less restrictive and, arguably, is no limitation at all. In Part II we retained the assumption 
that the learned value functions (and policies) are functions of the environment’s state, but allowed 
these functions to be arbitrarily restricted by the parameterization. It is somewhat surprising and not 
widely recognized, but function approximation includes important aspects of partial observability. For 
example, if there is some state variable that is not observable, then the parameterization can be chosen 
such that the approximate value does not depend on that state variable. The effect is just as if that state 
variable was not observable. Because of this, all the results obtained for the parameterized case apply 
to partial observability without change. In this sense, the case of parameterized function approximation 
includes the case of partial observability. 

Nevertheless, there are many issues that cannot be investigated without a more explicit treatment of 
partial observability. Although we cannot give them a full treatment here, we can outline the changes 
that would be needed to do so. There are four steps. 

First, we would change the problem. The environment would emit not its states, but only obser¬ 
vations —signals that depend on its state but, like a robot’s sensors, provide only partial information 
about it. For convenience, without loss of generality, we assume that the reward is a direct, known 
function of the observation (perhaps the observation is a vector, and the reward is one of is components). 
The environmental interaction would then have no explicit states or rewards, but would simply be an 
alternating sequence of actions A t G A and observations O t G 0: 

Aq, O u A u O 2 , A 2 , 03, A 3 , 04 ,..., 

going on forever (cf. Equation 3.1) or forming episodes each ending with a special terminal observation. 

Second, we can recover the idea of state as used in this book from the sequence of observations and 
actions. Let us use the word history , and the notation H t , for an initial portion of the trajectory up to 
an observation: H t = Ao,Oi,..., A t _i,O t . The history represents the most that we can know about 
the past without looking outside of the data stream (because the history is the whole past data stream). 
Of course, the history grows with t and can become large and unwieldy. The idea of state is that of 
some compact summary of the history that is as useful as the actual history for predicting the future. 
Let us be clear about exactly what this means. To be a summary of the history, the state must be a 
function of history, S t = and to be as useful for predicting the future as the whole history is 

known as the Markov property. Formally, this is a property of the function f. A function / has the 
Markov property if and only if any two histories h and h! that are mapped by / to the same state 
= also have the same probabilities for their next observation, 

f{h) = f(ti) => Vv{O t = o\H t = h,A t = a} = Vv{O t+ \=o\H t = ti,A t = a}, (17.6) 

for all o € 0 and a £ A. If f is Markov, then S t = f(H t ) is a state as we have used the term in this 
book. Let us henceforth call it a Markov state to distinguish it from states that are summaries of the 
history but fall short of the Markov property (which we will consider shortly). 



384 


CHAPTER 17. FRONTIERS 


A Markov state is a good basis for predicting the next observation (17.6) but, more importantly, it is 
also a good basis for predicting or controlling anything. For example, let a test be any specific sequence 
of alternating actions and observations that might occur in the future. For example, a three-step test 
is denoted r = aiOi(i 2 , 02 , « 3 , 03 . The probability of this test given a specific history h is defined as 

p{r\h) = Pr{Oj+i =01, Ot+2 =02, Ot+3 = 03 | H t = h, A t = ai, A t+ 1 =02, A t+ 2 =013}. ( 17 . 7 ) 

If / is Markov and h and h! are any two histories that map to the same state under /, then for any test 
r of any length, its probabilities given the two histories must also be the same: 

f(h) = f(h') => p(r\h) =p(r\ti). (17.8) 

In other words, a Markov state summarizes all the information in the history necessary for determining 
any test’s probability. In fact, it summarizes all that is necessary for making any prediction, including 
any GVF, and for behaving optimally (if / is Markov, then there is always a deterministic function 7 r 
such that choosing A t = tt is optimal). 

The third step in extending reinforcement learning to partial observability is to deal with certain 
computational considerations. In particular, we want the state to be a compact summary of the his¬ 
tory. For example, the identity function completely satisfies the conditions for a Markov /, but would 
nevertheless be of little use because the corresponding state St = H t would grow with time and becomes 
unwieldy, as mentioned earlier, but more fundamentally because it would never recur; the agent would 
never encounter the same state twice (in a continuing task) and thus could never benefit from a tab¬ 
ular learning method. We want our states to be compact as well as Markov. There is a similar issue 
regarding how state is obtained and updated. We don’t really want a function / that takes whole his¬ 
tories. Instead, for computational reasons we prefer to obtain the same effect as / with an incremental, 
recursive update that computes S t+ 1 from S t , incorporating the next increment of data, A t and O t + 1 : 

S t+ 1 = u(S t , A t ,O t + 1 ), for all t > 0 , (17.9) 

with the first state So given. The function u is called the state-update function. For example, if / were 
the identity ( St. = H t ), then u would merely extend St by appending A t and Ot.+i to it. Given /, it is 
always possible to construct a corresponding u, but it may not be computationally convenient and, as 
in the identity example, it may not produce a compact state. The state-update function is a central 
part of any agent architecture that handles partial observability. It must be efficiently computatible, as 
no actions or predictions can be made until the state is available. 

An example of obtaining Markov states through a state-update function is provided by the popular 
Bayesian approach known as Partially Observable MDPs , or POMDPs. In this approach the environ¬ 
ment is assumed to have a well defined latent state X t that underlies and produces the environment’s 
observations, but is never available to the agent (and is not to be confused with the state S t used by the 
agent to make predictions and decisions). The natural Markov state S t for a POMDP is the distribution 
over the latent states given the history, called the belief state. For concreteness, assume the usual case 
in which there are a finite number of hidden states, X t £ {1,2,..., rf}. Then the belief state is the 
vector St = s t £ with components 

s t [i] = Pr {X t = i | H t }, for all possible latent states i £ {1,2,..., d}. (17.10) 

The belief state remains the same size (same number of components) however t grows. It can also be 
incrementally updated by Bayes rule, assuming one has complete knowledge of the internal workings of 
the environment. Specifically, the ith component of the belief-state update function is 

Ex=i s M p(i,o\x,a) 


u{ s, a, o)[?'] 


(17.11) 
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for all a € A, o £ 0, and belief states s € R d with components s[x], where the four-argument p function 
here is not the usual one for MDPs (as in Chapter 3), but the analogous one for POMDPs, in terms 
of the latent state: p(x',o\x,a) = Pr{A t = x',Ot = o \ X t -\ = x, A t _\ = a}. This approach is popu¬ 
lar in theoretical work and has many significant applications, but its assumptions and computational 
complexity scale poorly and we do not recommend it as an approach to artificial intelligence. 

Another example of Markov states is provided by Predictive State Representations, or PSRs. PSRs 
address a weakness of the POMDP approach that the semantics of its agent state S t are grounded in 
the environment state, X t , which is never observed and thus is difficult to learn about. In PSRs and 
related approaches, the semantics of the agent state is instead grounded in predictions about future 
observations and actions, which are readily observable. In PSRs, a Markov state is defined as a d- 
vector of the probabilities of d specially chosen “core” tests as defined above (17.7). The vector is then 
updated by a state-update function u that is analogous to Bayes rule, but with a semantics grounded 
in observable data, which arguably makes it easier to learn. This approach has been extended in 
many ways, including end-tests, compositional tests, powerful “spectral” methods, and closed-loop and 
temporally abstract tests learned by TD methods. Some of the best theoretical developments are for 
systems known as Observable Operator Models (OOMs) and Sequential Systems (Thon, 2017). 

The fourth and final step in our brief outline of how to handle partial observability in reinforcement 
learning is to re-introduce approximation. As discussed in the introduction to Part II, to approach 
artificial intelligence ambitiously one must embrace approximation. This is just as true for states as it 
is for value functions. We must accept and work with an approximate notion of state. The approximate 
state will play the same role in our algorithms as before, so we continue use the notation S t for the 
state used by the agent, even though it may not be Markov. 

Perhaps the simplest example of an approximate state is just the latest observation, St = Ot . Of course 
this approach cannot handle any hidden state information. Better is to use the last k observations and 
actions, St = Ot, A t -i, Ot-i, ■ ■ ■, Ot.-k, for some k > 1, which can be achieved by a state-update 
function that just shifts the new data in and the oldest data out. This kth-order history approach is 
still very simple, but can greatly increase the agent’s capabilities compared to trying to use the single 
immediate observation directly as the state. 

What happens when the Markov property (17.6) is only approximately satisfied? Prediction per¬ 
formance can degrade dramatically. Longer-term tests, GVFs, and state-update functions may all 
approximate poorly with an approximate state even if the one-step predictions (17.6) defining the 
Markov property are well approximated with it. There are essentially no useful theoretical guarantees 
at present. 

Nevertheless, there are still reasons to think that the general idea outlined in this section applies 
to the approximate case. The general idea is that a state good for some predictions is also good for 
others (in particular, that a Markov state, sufficient for one-step predictions, is also sufficient for all 
others). If we step back from that specific result for the Markov case, the general idea is similar to 
what we discussed in Section 17.1 with multi-headed learning and auxiliary tasks. We discussed how 
representations that were good for the auxiliary tasks were often also good for the main task. Taken 
together, these suggest an approach to both partial observability and representation learning in which 
multiple predictions are pursued and used to direct the contruction of state features. The guarantee 
provided by the perfect-but-impractical Markov property is replaced by the heuristic that what’s good 
for some predictions may be good for others. This approach scales well with computational resources. 
With a large machine one could experiment with large numbers of predictions, perhaps favoring those 
that are most similar to the ones of ultimate interest, that are easiest to learn reliably, or by other 
criteria. Key here is to move beyond selecting the predictions manually. The agent should do it. This 
would require a general language for predictions, so that the agent can systematically explore a large 
space of possible predictions, sifting through them for the ones that are most useful. 
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17.4 Designing Reward Signals 

A major advantage of reinforcement learning over supervised learning is that reinforcement learning does 
not rely on detailed instructional information: designing reward signals does not depend on knowing 
what the correct actions should be. But not all reward signals are created equal. The success of a 
reinforcement learning application often strongly depends on how well the reward signal frames the 
problem and how well it assesses progress in solving it. Future applications can benefit from a better 
understanding of how reward signals affect learning and from improved methods for designing them. 

The usual way to use reinforcement learning to solve a problem is to reward the agent according to 
its success in solving the problem. This is relatively easy for many problems, but some problems have 
goals that are difficult to translate into reward signals. This is especially true when the problem is to get 
an agent to skillfully perform a complex task. Even when there is a simple goal that is easy to identify, 
the problem of sparse rewards often arises. Delivering non-zero reward frequently enough to allow the 
agent to achieve the goal once, let alone to learn to achieve it efficiently from multiple initial conditions, 
can be a daunting challenge. Further, reinforcement learning agents can discover unexpected ways to 
make their environments deliver reward, some of which might be undesirable, or even dangerous. For 
these reasons, designing reward signals is a critical part of any reinforcement learning application. 

In practice, designing reward signals is often left to an informal trial-and-error search for a signal 
that produces acceptable results. If the agent fails to learn, learns too slowly, or learns the wrong 
thing, the designer tweaks the reward signal and tries again. To do this, the designer judges the agent’s 
performance by criteria that he or she is attempting to translate into reward signals so that the the 
agent’s goal matches his or her own. Some more sophisticated ways to find good reward signals have 
been proposed, but the subject has interesting and relatively unexplored dimensions. 

It is tempting to address the sparse reward problem by rewarding the agent for achieving subgoals that 
the designer thinks are important way stations to the overall goal. But augmenting the reward signal 
with well-intentioned supplemental rewards may lead the agent to behave very differently from what 
is intended; the agent may end up not achieving the overall goal at all. A better way to provide such 
guidance is to leave the reward signal alone and instead augment the value-function approximation with 
an initial guess of what it should ultimately be. For example, suppose one wants to offer vq : § —> K. as 
an initial guess at the true optimal value function u*, and that one is using linear function approximation 
with feature x : S —> R d . Then one would define the approximate value function as 

v(s,w) = w T x(s) + u 0 (s), (17.12) 

and update the weights w as usual. If the initial weight vector is 0, then the initial value function will 
be Vo, but the asymptotic solution quality will be determined by the feature vectors as usual. This 
initialization works for arbitrary nonlinear approximators and arbitrary forms of vq- Wiewiora (2003) 
showed that this initialization is equivalent to the more complex “potential-based shaping” technique 
for changing rewards described by Ng, Harada, and Russell (1999). 

What if one has no idea what the rewards should be but there is another agent, perhaps a person, who 
is already expert at the task and whose behavior can be observed? In this case one can use a variety of 
methods known variously as “imitation learning,” “learning from demonstration,” and “apprenticeship 
learning.” The idea here is to benefit from the expert agent but leave open the possibility of eventually 
performing even better. Learning from an expert’s behavior can be done either by learning directly 
by supervised learning or by extracting a reward signal using what is known as “inverse reinforcement 
learning” and then using a reinforcement learning algorithm with that reward function to learn a 
policy. The task of inverse reinforcement learning as explored by Ng and Russell (2000) is to recover 
the expert’s reward signal (within a scalar constant) from the expert’s behavior alone. This cannot be 
done exactly because a policy can be optimal with respect to many different reward signals (always 
including any reward signal that gives the same reward for all states and actions), but it is possible 
to find plausible reward-signal candidates. However, some strong assumptions are required, including 
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knowledge of the feature vectors in which the reward signal is linear and complete knowledge of the 
environment’s dynamics. The method also requires completely solving the problem (e.g., by dynamic 
programming methods) multiple times. These difficulties notwithstanding, Abbeel and Ng (2004) argue 
that the inverse reinforcement approach can sometimes be more effective than supervised learning for 
benefiting from the behavior of an expert. 

Another approach to finding a good reward signal is based on automating the trial-and-error search 
for such a signal that we mentioned above. From an engineering perspective, the reward signal is 
a parameter of the learning algorithm. As is true for other algorithm parameters, the search for 
a good reward signal can be automated by defining a space of feasible candidates and applying an 
optimization algorithm. The optimization algorithm evaluates each candidate reward signal by running 
the reinforcement learning system with that signal for some number of steps, and then scoring the 
overall result by a “high-level” objective function intended to encode the designer’s true goal, ignoring 
the limitations of the agent. In some cases, reward signals can be improved via online gradient ascent, 
again as evaluated by a high-level objective function (Sorg, Lewis, and Singh, 2010). Relating this to 
the natural world, the the algorithm for optimizing the high-level objective function is analogous to 
evolution, where the high-level objective function is like an animal’s evolutionary fitness determined by 
the number of its offspring that survive to reproductive age. 

Experiments with this bilevel optimization approach (Singh, Lewis, and Barto, 2009) confirmed that 
intuition alone is not always adequate to devise good reward signals. The performance of a reinforcement 
learning agent as evaluated by the high-level objective function can be very sensitive to details of the 
agent’s reward signal in subtle ways determined by the agent’s limitations and the environments in 
which it acts and learns. These experiments also demonstrated that an agent’s goal should not always 
be the same as the goal of the agent’s designer. 

At first this seems counterintuitive, but it may be impossible for the agent to achieve the designer’s 
goal no matter what its reward signal is because the agent has to learn under various kinds of constraints, 
such as having limited computational power, limited access to information about its environment, or 
limited time to learn. When there are constraints like these, learning to achieve a goal that is different 
from the designer’s goal can sometimes end up getting closer to the designer’s goal than if that goal were 
pursued directly (Sorg, Singh, and Lewis, 2010; Sorg, 2011). There are also abundant examples of this 
in the natural world. Since we cannot directly detect the nutritional value of most foods, evolution—the 
designer of our reward signal—-produced a reward signal that makes us seek certain tastes. Though 
certainly not infallible (indeed, possibly detrimental in environments that differ in certain ways from 
the ancestral environments), this compensates for many of our limitations: our limited sensory abilities, 
the limited time over which we can learn, and the risks involved in finding a diet through personal 
reinforcement learning. Similarly, since an animal cannot observe its own evolutionary fitness, that 
evaluation function does not work as a reward signal for learning (although predictors of evolutionary 
fitness certainly can be observed and figure prominently in animals’ reward signals). 

Another dimension to devising reward signals is whether the agent is to learn to solve a specific 
problem, or if instead, it is to learn skills that can be useful across many different problems that the 
agent is likely to face in the future. Pursuing the latter goal has led to the idea of implementing in 
reinforcement learning something like what psychologists call “intrinsic motivation.” Where “extrinsic 
motivation” means doing something because of some specific rewarding outcome, “intrinsic motivation” 
refers to doing something “for its own sake.” Intrinsic motivation leads animals to engage in exploration, 
play, and other behavior driven by curiosity in the absence of problem-specific rewards. The true value 
of what is learned via intrinsically-motivated behavior (which for an animal would be the value of the 
evolutionary advantage it confers) emerges over long-term experience with many different specific tasks. 

Giving an agent something analogous to intrinsic motivation can be done by devising a reward signal 
that helps an agent learn widely useful skills, including skills that aid the learning process itself. Reward 
signals can depend on such things as a general ability to cause changes in the environment, assessments 
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of general progress in learning, or other measures that do not depend on a goal of performing a specific 
task. An example is the “bonus reward” described in Section 8.3. Instead of being tied to a specific task, 
this reward signal encourages exploration in general, which benefits the learning of many specific tasks. 
Another example is the proposal by Sclnnidhuber (1991a, b) for how something like curiosity would 
result if reward signals were a function of how quickly an agent’s environment model was improving 
in predicting state transitions. Many preliminary studies of such computational curiosity have been 
conducted and are exciting topics of ongoing research. 


17.5 Remaining Issues 

In this book we have presented the foundations of a reinforcement learning approach to artificial in¬ 
telligence. Roughly speaking, that approach is based on model-free and model-based methods working 
together, as in the Dyna architecture of Chapter 8, combined with function approximation as developed 
in Part II. The focus has been on online and incremental algorithms, which we see as fundamental even 
to model-based methods, and on how these can be applied in off-policy training situations. The full 
rationale for the latter has been presented only in this last chapter. That is, we have all along pre¬ 
sented off-policy learning as an appealing way to deal with the explore/exploit dilemma, but only in this 
chapter have we talked about learning about many diverse auxiliary tasks simultaneously with GVFs, 
and about understanding the world hierarchically in terms of temporally-abstract option models, both 
of which seem to ineluctably involve off-policy learning. Much remains to be worked out, as we have 
indicated throughout the book and as evidenced by the directions for additional research discussed in 
this chapter. But suppose we are generous and grant the broad outlines of everything that we have 
done in the book and everthing that has been outlined in this chapter. What would remain even after 
that? Of course, we can’t know for sure what will be required, but we can make some guesses. In this 
section we highlight four further issues which it seems to us will still need to be addressed by future 
research. 

First, we still need powerful parametric function approximation methods that work well in fully 
incremental and online settings. Methods based on deep learning and artificial neural networks are 
a major step in this direction, but rely on batch training with large data sets, extensive off-line self 
play, or learning asynchronously from multiple simultaneous streams agent-environment interaction. 
These and other techniques are ways of working around a basic limitation of today’s deep learning 
methods, which struggle to learn rapidly in the incremental, online settings that are most natural for 
reinforcement learning settings and that we have emphasized in this book. The problem is sometimes 
described as one of “correlated data” or “catastrophic interference”. When something new is learned 
it tends to replace what has previously been learned rather than adding to it, such that the benefit 
of the older learning is lost. Techniques such as “replay buffers” are often used to retain and replay 
old data so that its benefits are not permanently lost. An honest assessment has to be that current 
deep learning methods just don’t learn well online. We don’t see any reason why they couldn’t, but the 
learning algorithms to do this have not yet been devised, and the bulk of current research is directed 
toward working around this limitation of current algorithms rather than to removing it. 

Second, and perhaps closely related, we still need methods for learning features such that subsequent 
learning generalizes well. This issue is an instance of a general problem variously called “representation 
learning,” “constructive induction,” and “meta-learning” how can we use experience not just to learn 
a given a desired function, but to learn inductive biases such that future learning generalizes better and 
is thus faster? This is an old problem, dating back to the origins of artificial intelligence and pattern 
recognition in the 1950s and 1960s. (Some would claim that deep learning solves this problem, but we 
consider it still unsolved.) Such age should give one pause. Perhaps there is no solution? But just as 
likely the time has not previously been ripe for any solution being found and being shown effective. 
Today machine learning is conducted at a far larger scale and the benefits of a good representation 
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learning method are potentially much more apparent. We note that a new annual conference—the 
International Conference on Learning Representations—has been exploring this and related topics every 
year since 2013. It is also new to explore representation learning in a reinforcement learning context. 
Reinforcement learning brings some new possibilities to this old issue, such as the auxiliary tasks 
discussed in Section 17.1. In reinforcement learning, the problem of representation learning can be 
identified with the problem of learning the state-update function discussed in Section 17.3. 

Third, we still need scalable methods for planning with learned models. Planning methods have 
proven extremely effective in applications such as AlphaGo Zero and computer chess in which the 
model of the environment is known from the rules of the game or can otherwise be designed in by 
people. But cases of full model-based reinforcement learning, in which the environmental model is 
learned from data and then used for planning, are rare. The Dyna system described in Chapter 8 is one 
example, but as described there and in most subsequent work it used a tabular model without function 
approximation of any sort, which greatly limits its applicability. There have been only a few studies 
with learned linear models, and even fewer that have also tried to incorporate temporally abstract 
models using options as discussed in Section 17.2. 

These limitations are a problem because they greatly limit the effectiveness of planning. In particular, 
model making needs to be selective because the contents of the model strongly affect planning efficiency. 
If the model is focused on the key consequences of the most important possible options, then planning 
can be efficient and rapid, but if the model details the unimportant consequences of options that are 
unlikely to be chosen, then planning may be almost useless. Environmental models must be constructed 
judiciously in both their states and dynamics so as to optimize the planning process. The various parts 
of the model will have to be continually monitored for the degree to which they contribute to or detract 
from planning efficiency. The field has not yet come to grips with this complex of issues or designed 
model-learning methods that take into account their implications. To make a good model that supports 
planning is analogous to obtaining a true understanding of the environment that enables reasoning to 
obtain a goal. As such it would be a significant milestone in artificial intelligence. 

The fourth and final issue that strikes us as as needing to be addressed in future research is that 
of automating the choice of subproblems on which an agent works and which it uses to structure its 
developing mind. In machine learning, designers are used to setting the problems or tasks for the 
learning agent. Because these tasks are fixed, we build them into the code for the learning algorithm. 
However, looking ahead we will want the agent to make its own choices about what tasks to work on. 
These tasks may be like the auxiliary tasks or the GVFs discussed in Section 17.1. In forming a GVF, 
for example, what should the cumulant, the policy, and the termination function be? The current state 
of the art is to select these manually, but far greater power and generality would come from making 
these task choices automatically, particularly when they are from things previously constructed by the 
agent as a result of representation learning or previous subproblems. If GVF design is automated, 
then the design choices themselves will have to be explicitly represented. Rather than the task choices 
being in the mind of the designer and built into the code, they will have to be in the machine itself in 
such a way that they can be set and changed, monitored, filtered, and searched among automatically. 
Tasks could then be built hierarchically upon one each other much like features are in a neural network. 
The tasks are the questions and the contents of the neural network the answers to those questions. 
We expect there will need to be a full hierarchy of questions to match the hierarchy of the answers in 
modern deep learning. 
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17.6 Reinforcement Learning and the Future of Artificial In¬ 
telligence 


When we were writing the first edition of this book in the mid-1990s, artificial intelligence was making 
significant progress and was having an impact on society, though it was mostly still the promise of 
artificial intelligence that was inspiring developments. Machine learning was part of that outlook, but 
it had not yet become indispensable to artificial intelligence. By today that promise has transitioned 
to applications that are changing the lives of millions of people, and machine learning has come into 
its own as a key technology. As we write this second edition, some of the most remarkable develop¬ 
ments in artificial intelligence have involved reinforcement learning, most notably “deep reinforcement 
learning”—reinforcement learning with function approximation by deep neural networks. We are at 
the beginning of a wave of real-world applications of artificial intelligence, many of which will include 
reinforcement learning, deep and otherwise, that will impact our lives in ways that are hard to predict. 

But an abundance of successful real-world applications does not mean that true artificial intelli¬ 
gence has arrived. Despite great progress in many areas, the gulf between artificial intelligence and 
the intelligence of humans, and even of other animals, remains great. Superhuman performance can be 
achieved in some domains, even formidable domains like Go, but it remains a significant challenge to 
develop systems that are like us in being complete, interactive agents having general adaptability and 
problem-solving skills, emotional sophistication, creativity, and the ability to learn quickly from expe¬ 
rience. With its focus on learning by interacting with dynamic environments, reinforcement learning, 
as it develops over the future, will be a critical component of agents with these abilities. 

Reinforcement learning’s connections to psychology and neuroscience (Chapters 14 and 15) under¬ 
score its relevance to another longstanding goal of artificial intelligence: shedding light on fundamental 
questions about the mind and how it emerges from the brain. Reinforcement learning theory is al¬ 
ready contributing to our understanding of natural reward, motivation, and decision-making systems, 
understanding that can contribute to improving human abilities to learn, to remain motivated, and to 
make decisions. There is also good reason to believe that through its links to computational psychiatry, 
reinforcement learning theory will contribute to methods for treating mental disorders, including drug 
abuse and addiction. 

Another contribution that reinforcement learning can make over the future is as an aid to human 
decision making. Policies derived by reinforcement learning in simulated environments can advise 
human decision makers in such areas as education, healthcare, transportation, energy, and public-sector 
resource allocation. Particularly relevant is the key feature of reinforcement learning that it takes long¬ 
term consequences of decisions into account. This is very clear in games like backgammon and Go, 
where some of the most impressive results of reinforcement learning have been demonstrated, but it 
is also a property of many high-stakes decisions that affect our lives and our planet. Reinforcement 
learning follows related methods for advising human decision making that have been developed in the 
past by decision analysts in many disciplines. With advanced function approximation methods and 
massive computational power, reinforcement learning methods have the potential to overcome some of 
the difficulties of scaling up traditional decision-support methods to larger and more complex problems. 

The rapid pace of advances in artificial intelligence has led to warnings that artificial intelligence 
poses serious threats to our societies, even to humanity itself. The renowned scientist and artificial 
intelligence pioneer Herbert Simon anticipated the warnings we are hearing today in a presentation at 
the Earthware Symposium at CMU in 2000 (Simon, 2000). He spoke of the eternal conflict between the 
promise and perils of any new knowledge, reminding us of the Greek myths of Prometheus, the hero 
of modern science, who stole fire from the gods for the benefit of mankind, and Pandora, whose box 
could be opened by a small and innocent action to release untold perils on the world. While accepting 
that this conflict is inevitable, Simon urged us to recognize that as designers of our future and not 
simply as spectators, the decisions we make can tilt the scale in Prometheus’ favor. This is certainly 
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true for reinforcement learning, which can benefit society but can also produce undesirable outcomes if 
it is carelessly deployed. Thus, the safety of artificial intelligence applications involving reinforcement 
learning is a topic that deserves careful attention. 

A reinforcement learning agent can learn by interacting with either the real world or with a simulation 
of some piece of the real world, or by a mixture of these two sources of experience. Simulators provide 
safe environments in which an agent can explore and learn without risking real damage to itself or to 
its environment. In most current applications, policies are learned from simulated experience instead 
of direct interaction with the real world. In addition to avoiding undesirable real-world consequences, 
learning from simulated experience can make virtually unlimited data available for learning, generally 
at less cost than needed to obtain real experience, and since simulations typically run much faster than 
real time, learning can often occur more quickly than if it relied on real experience. 

Nevertheless, the full potential of reinforcement learning requires reinforcement learning agents to 
be embedded into the flow of real-world experience, where they act, explore, and learn in our world, 
and not just in their worlds. After all, reinforcement learning algorithms—at least those upon which 
we focus in this book—are designed to learn online, and they emulate many aspects of how animals are 
able to survive in nonstationary and hostile environments. Embedding reinforcement learning agents 
in the real world can be transformative in realizing the promises of artificial intelligence to amplify and 
extend human abilities. 

A major reason for wanting a reinforcement learning agent to act and learn in the real world is that it 
is often difficult, sometimes impossible, to simulate real-world experience with enough fidelity to make 
the resulting policies, whether derived by reinforcement learning or by other methods, work well—and 
safely—when directing real actions. This is especially true for environments whose dynamics depend on 
the behavior of humans, such as in education, healthcare, transportation, and public policy, domains 
that can surely benefit from improved decision making. However, it is for real-world embedded agents 
that warnings about potential dangers of artificial intelligence need to be heeded. 

Reinforcement learning is a collection of optimization methods, so it inherits the pluses and minuses 
of all optimization methods. On the minus side is the problem we mentioned at several places above: 
how do you devise objective functions, or reward signals in the case of reinforcement learning, so that 
optimization produces the desired results while avoiding undesirable results? This is hardly a new 
problem with reinforcement learning; recognition of it has a long history in literature and engineering. 
The founder of cybernetics, Norbert Weiner, for one, warned of this more than half a century ago by 
relating the supernatural story of “The Monkey’s Paw” (Weiner, 1964): wishes are granted but come 
with unacceptable cost. The problem has also been discussed at length in a modern context by Nick 
Bostrom (2014). Anyone having experience with reinforcement learning has likely seen their systems 
discover unexpected ways to obtain a lot of reward. Sometimes the unexpected behavior is good: it 
solves a problem in a nice new way. In other instances, what the agent learns violates considerations 
that the system designer may never have thought about. Careful design of reward signals is essential if 
an agent is to act in the real world with no opportunity for human vetting of its actions or means to 
easily interrupt its behavior. 

Despite the possibility of unintended negative consequences, optimization has been used for hundreds 
of years by engineers, architects, and others whose designs have positively impacted the world. Many 
approaches have been developed to mitigate the risk of optimization, such as adding hard and soft 
constraints, restricting optimization to robust and risk-sensitive policies, and optimizing with multiple 
objective functions. Some of these have been adapted to reinforcement learning. We owe much that 
is good in our environment to the application of optimization methods. Still, the problem of ensuring 
that a reinforcement learning agent’s goal is attuned to our own remains a challenge. 

Another challenge if reinforcement learning agents are to act and learn in the real world is not just 
about what they might eventually learn, but about how they will behave while they are learning. How 
do you make sure that an agent gets enough experience to learn a high-performing policy, all the while 
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not harming its environment, other agents, or itself (or more realistically, while keeping the probability 
of harm acceptably low)? This problem is also not novel or unique to reinforcement learning. Risk 
management and mitigation for embedded reinforcement learning is similar to what control engineers 
have had to confront from the beginning of using automatic control in situations where a controller’s 
behavior can have unacceptable, possibly catastrophic, consequences, as in the control of an aircraft 
or a delicate chemical process. Control applications rely on careful system modeling, model validation, 
and extensive testing, and there is a highly-developed body of theory aimed at ensuring convergence 
and stability of adaptive controllers designed for use when the dynamics of the system to be controlled 
are not fully known. Theoretical guarantees are never iron-clad because they depend on the validity of 
the assumptions underlying the mathematics, but without this theory, combined with risk-management 
and mitigation practices, automatic control—adaptive and otherwise—would not be as beneficial as it 
is today in improving the quality, efficiency, and cost-effectiveness of processes on which we have come 
to rely. Some of this theory has been adapted to reinforcement learning to help prevent unwanted 
behavior during, and after, learning, but many future applications of reinforcement learning are likely 
to be in domains that are less constrained than those to which control theory and practice readily 
apply. Developing methods to make it acceptably safe to fully embed reinforcement learning agents 
into physical environments is one of the most pressing areas for future research. 

In closing, we return to Simon’s call for us to recognize that we are designers of our future and 
not simply spectators. By decisions we make as individuals, and by the influence we can exert on 
how our societies are governed, we can work toward ensuring that the benefits made possible by a 
new technology outweigh the harm it can cause. There is ample opportunity to do this in the case 
of reinforcement learning, which can help improve the quality, fairness, and sustainability of life on 
our planet, but which can also release new perils. A threat already here is the displacement of jobs 
caused by applications of artificial intelligence. Still there are good reasons to believe that the benefits 
of artificial intelligence can outweigh the disruption it causes. As to safety, hazards possible with 
reinforcement learning are not completely different from those that have been managed successfully for 
related applications of optimization and control methods. As reinforcement learning moves out into 
the real world in future applications, developers have an obligation to follow best practices that have 
evolved for similar technologies, while at the same time extending them to make sure that Prometheus 
keeps the upper hand. 


Bibliographical and Historical Remarks 

17.1 General value functions were first explicitly identified by Sutton and colleagues (Sutton, 1995; 
Sutton et ah, 2011; Modayil, White and Sutton, 2013). Ring (in preparation) developed an 
extensive thought experiment with GVFs (“forecasts”) that has been influential despite not yet 
having been published. 

The first demonstrations of multi-headed learning in reinforcement learning were by Jaderberg 
et alia (2017). Bellemare, Dabney and Munos (2017) showed that predicting more things about 
the distribution of reward could significantly speed learning to optimize its expectation, an 
instance of auxiliary tasks. Many others have since taken up this line of research. 

The view of classical conditioning as learned predictions together with built-in, reflexive reac¬ 
tions to the predictions has not to our knowledge been clearly articulated in the psychological 
literature. Modayil and Sutton (2014) describe it as an approach to the engineering of robots 
and other agents, calling it “Pavlovian control” to allude to its roots in classical conditioning. 

17.2 The formalization of temporally abstract courses of action as options was introduced by Sutton, 
Precup, and Singh (1999), building on prior work by Parr (1998) and Sutton (1995a), and on 
classical work on Semi-MDPs (e.g., see Puterman, 1994). Precup’s (2000) PhD thesis developed 
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option ideas fully. An important limitation of these early works is that they treated either the 
tabular case or the on-policy case. The general case of intra-option learning involves off-policy 
learning, which could not be done reliably with function approximation at that time. Although 
now we have a variety of stable off-policy learning methods using function approximation, their 
combination with option ideas had not been significantly explored at the time of publication of 
this book. 

Using GVFs to implement option models has not previously been described. Our presentation 
uses the trick introduced by Modayil, White and Sutton (2014) for predicting signals at the 
termination of policies. 

Among the few works that have learned option models with function approximation are those 
by Bacon, Harb, and Precup (2017). 

The extension of options and option models to the average-reward setting has not yet been 
developed in the literature. 

17.3 For a good intuitive discussion of the system-theoretic concept of state, see Minsky (1967). A 
good presentation of the POMDP approach is given by Monahan (1982). PSRs and tests were 
introduced by Littman, Sutton and Singh (2002). OOMs were introduced by Jaeger (1997, 1998, 
2000). Sequential Systems, which unify PSRs, OOMs, and many other works, were introduced 
in the PhD thesis of Michael Thon (2017; Thon and Jaeger, 2015). 

The theory of reinforcement learning with a non-Markov state representation was developed 
explicitly by Singh, Jaakkola, and Jordan (1994). 

17.5 The problem of catastrophic interference in artificial neural networks was developed by Mc- 
Closkey and Cohen (1989), Ratcliff (1990), and French (1999). The idea of a replay buffer was 
introduced by Lin (1992) and used prominently in deep learning in the Atari game playing 
system (Section 16.5, Minh et al., 2013, 2015). 

Minsky (1961) was one of the first to identify the problem of representation learning. 

Among the few works to consider planning with learned, approximate models are those by 
Kuvayev and Sutton (1998), Sutton, Szepesvari, Geramifard, and Bowling (2008), and Nouri 
and Littman (2009). 

The need to be selective in model construction to avoid slowing planning is well known in 
artificial intelligence. Some of the classic work is by Minton (1990) and Tambe, Newell, and 
Rosenbloom (1990). Hauskreclrt, Mieuleau, Boutilier, Kaelbling, and Dean (1998) showed this 
effect in MDPs with deterministic options. 
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